Atlas expedition

LLM Eval Tooling Landscape 2026: A Consultancy's Decision Guide

Decision-grade 2026 comparison of nine LLM eval tools across license, self-host, CI/CD, judge, agent/RAG, pricing, stars and lock-in — with pick-X-if calls for a client-shipping consultancy.

41 sources ~9 min read llm-evals · ai-tooling · devtools · ci-cd · rag · agents · consultancy

Decision (consultancy shipping AI for clients). No single tool wins — build a two-layer stack: a code-first OSS eval framework that gates CI, plus an OSS platform you can self-host or run as SaaS per client.[1][37]

  • Default pick → DeepEval ⭐ 16k + Langfuse ⭐ 28k. Apache-2.0 pytest gating with the deepest metric set, paired with MIT self-hostable tracing/datasets you can stand up per-client with zero license cost.[10][22][39]
  • Pick Braintrust if clients pay for a polished managed UI, stakeholder dashboards and you don’t want to run infra — the strongest commercial experiment/annotation platform, $249/mo Pro, self-host only on Enterprise.[7][8][28]
  • Pick LangSmith if the client’s app is already LangChain/LangGraph — native, lowest-cost commercial entry ($39/seat) — but accept ecosystem lock-in.[12][36]
  • Pick Promptfoo ⭐ 22k for red-teaming / security (now OpenAI-owned, still MIT) and Inspect AI ⭐ 2.2k for safety/regulated, multi-provider capability evals.[2][4][17]
  • Add Ragas ⭐ 14k when the deliverable is RAG and you want reference-free retrieval metrics.[15][16]

For expert consultants the load-bearing insight isn’t the tool — it’s that harness design swings published benchmark scores 10–20 points on SWE-Bench Verified, so the real asset is your client’s golden dataset + CI gates, and tools are interchangeable plumbing around it.[5][38]


The 2026 shape of the market

Two consolidation events reshaped the landscape this cycle:

  • OpenAI acquired Promptfoo (announced 9 Mar 2026, ~$86M) — used by 125k devs and 25%+ of the Fortune 500. Red-teaming folds into OpenAI Frontier; the repo stays MIT and open.[2][3][33] For a consultancy shipping non-OpenAI models, the open caveat is vendor objectivity — pair it with an independent scorer (Inspect AI) if that matters to a client.[34]
  • Humanloop is gone — acqui-hired by Anthropic (founders + ~12 staff, Aug 2025, no IP/assets), standalone platform sunset 8 Sep 2025, tech folded into the Anthropic Console. Don’t design a 2026 stack around it.[26][27]
  • Two OpenAI “Evals” exist and are easy to conflate: the OSS openai/evals framework (MIT, ⭐ 19k, benchmark registry, still developed) versus the hosted Evals platform/API, which OpenAI has announced for deprecation alongside Agent Builder. Treat the hosted product as a dead end; the OSS repo is a benchmark-running tool, not a product eval platform.[13][14]
  • Braintrust raised ~$80M Series B → ~$800M valuation (Feb 2026) — the commercial eval-platform tier is consolidating around it and LangSmith.[40]

Master comparison

Stars are current as of June 2026 (GitHub API). = first-class/native, ~ = possible but not the tool’s focus, = not supported.

Tool ⭐ Stars OSS / License Self-host vs SaaS CI/CD fit Dataset / experiment mgmt LLM-as-judge Agent eval RAG eval Pricing Lock-in
DeepEval ⭐ 16k OSS · Apache-2.0 Self-host (free); SaaS = Confident AI ✓ native deepeval test run (pytest) ~ via Confident AI cloud ✓ G-Eval, DAG ✓ task completion, tool correctness ✓ faithfulness, ctx recall/precision Framework free; Confident AI $19.99–49.99/user; self-host @ Team/Ent Low
Promptfoo ⭐ 22k OSS · MIT (OpenAI-owned) Self-host / CLI; Enterprise SaaS ✓ native GitHub Action (PR review) ~ YAML cases, no rich UI ✓ custom assertions ~ red-team focus ~ not primary Free OSS; Enterprise on-prem Low ⚠ vendor objectivity
Braintrust — (closed) Commercial · proprietary SaaS; self-host = Enterprise only ✓ GitHub Action → SDK ✓ best-in-class UI, diffs, annotation ✓ sandboxed-Python scorers ✓ lifecycle metrics ~ via custom scorer Free → Pro $249/mo → Ent Medium
LangSmith — (closed) Commercial · proprietary SaaS; self-host = Enterprise add-on ✓ native evaluator runs ✓ trace-curated datasets ✓ online + offline ✓ native LangGraph ~ general framework Free 5k → Plus $39/seat → Ent High (LangChain)
Ragas ⭐ 14k OSS · Apache-2.0 Self-host (library) ~ wrap in pytest/CI ✗ (metrics lib) ~ judge-backed metrics ~ limited purpose-built, reference-free Free Low
Inspect AI ⭐ 2.2k OSS · MIT (UK AISI) Self-host (library + UI) ~ bring-your-own GitHub Actions ✓ Dataset→Task→Solver→Scorer ✓ model-graded + custom ✓ Docker-sandboxed agentic ~ capability-level Free Low
OpenAI Evals ⭐ 19k OSS · MIT Self-host only ~ no runner; wrap in Actions ✓ benchmark registry ✓ model-graded YAML ~ Completion-Fn protocol Free (hosted platform deprecating) Low
Arize Phoenix ⭐ 10k OSS · Elastic 2.0 Self-host (free); SaaS = Arize AX ✓ code evaluators, LLM jury ✓ score traces in-UI ✓ Code Evaluators, LLM jury ✓ OTel agent spans ✓ embedding-based OSS free; AX Free → AX Pro $50/mo Low (OTel-native)
Langfuse ⭐ 28k OSS · MIT Self-host free, unlimited; SaaS ✓ SDK in CI; datasets/runs ✓ datasets, runs, scores ✓ LLM-as-judge templates ✓ trace/span eval ✓ trace eval Self-host free; cloud $29 / $199 / $2,499 Low

Sources backing the cells above: DeepEval[9][10][11][30]; Promptfoo[2][4][6][28]; Braintrust[5][7][8]; LangSmith[12][35][36]; Ragas[15][16]; Inspect AI[17][18]; OpenAI Evals[13][14]; Phoenix[19][20][21][31]; Langfuse[22][23][29].

Two axes that actually decide it

Axis 1 — framework (CI gate) vs platform (UI + storage). DeepEval, Promptfoo, Ragas, Inspect AI and OpenAI Evals are frameworks: they run in code, exit non-zero on a failed assertion, and gate a PR. Braintrust, LangSmith, Langfuse and Phoenix are platforms: persistent storage, dataset curation, annotation UI, production monitoring. The 2026 consensus is you want one of each, not one tool doing both badly.[1][37]

Axis 2 — CI gating strength. DeepEval (deepeval test run) and Promptfoo (PR-review GitHub Action) are the only two with native, first-class CI gating; everyone else gates by wrapping their SDK/API in your own GitHub Actions glue. For a consultancy whose value proposition is “we ship tested AI,” that native gate is worth optimising for.[41][18]

Self-host & lock-in (the consultancy-critical column)

A consultancy hands deliverables to clients with varying data-residency rules. Self-host capability and license terms dominate.

Capability Free self-host? License gotcha Verdict for client work
Langfuse ✓ unlimited, MIT none Best — drop on client infra, no fees[22][23]
DeepEval ✓ framework; SaaS UI @ Team/Ent Apache-2.0 Best — pure-code, runs anywhere[9][10]
Inspect AI / Promptfoo / Ragas MIT / MIT / Apache-2.0 Best — libraries, zero lock-in[15][17][4]
Phoenix ✓ no feature gates, air-gappable Elastic 2.0 — can’t resell it as a hosted service Fine for internal/client deploys; ⚠ can’t white-label as your own SaaS[20][21]
Braintrust ✗ Enterprise-only proprietary Locked to their cloud unless client buys Enterprise[7][8]
LangSmith ✗ Enterprise add-on proprietary Enterprise custom contract; + LangChain ecosystem pull[12][35][36]

Note the ELv2 trap for Phoenix: free to self-host for any client internally, but you may not offer it back as a managed/hosted service — relevant if your consultancy’s product is a hosted eval dashboard.[20] Langfuse’s MIT has no such restriction, and self-hosts at ~$1/GB-month vs Braintrust’s ~$3/GB — cheaper at scale.[29]

Worth-a-mention tier

  • Langfuse ⭐ 28k — the most-starred OSS LLMOps platform; arguably the default open observability+eval+prompt-management layer for 2026, and the strongest Braintrust alternative when self-hosting matters.[22][32] Promoted into the main table above on merit.
  • Evidently AI ⭐ 7.6k (Apache-2.0) — open-source eval/monitoring building blocks, not a turnkey review platform; reaching a dedicated tool’s maturity takes real engineering investment. Pick only if you want to assemble your own.[24]
  • Patronus AI — evaluation-first, differentiated by proprietary judge models: Lynx (hallucination), GLIDER (rubric scoring), Percival (agent monitoring). Niche but real when you want managed, research-grade judges instead of rolling your own.[25]
  • Humanloop — dead. Sunset 8 Sep 2025 after Anthropic acqui-hire; listed only so you don’t propose it.[26][27]

Pick-X-if (recommendation grid for a client-shipping consultancy)

If the client situation is… Pick Why
Default / greenfield, data-residency varies DeepEval + Langfuse Apache/MIT, self-host anywhere, native pytest gate + free unlimited platform[10][22][39]
Client wants a managed, polished UI; budget OK Braintrust Pro Best experiment/annotation UX, diffs, sandboxed scorers, $249/mo[7][28]
App is LangChain/LangGraph LangSmith Native tracing/eval, $39/seat — lowest commercial entry, accept lock-in[12][36]
Security / red-team / regulated Promptfoo (+Inspect AI) MIT red-teaming; add Inspect AI for vendor-independent scoring[2][17][34]
Public-sector / safety / multi-provider capability evals Inspect AI UK AISI, MIT, 200+ evals across 10+ providers, Docker sandboxing[17][18]
Deliverable is RAG Ragas (+DeepEval) Reference-free retrieval metrics; DeepEval for agent/safety overlap[15][16]
Already OTel-instrumented Arize Phoenix OTel-native, score traces in-UI, no SDK lock-in (mind ELv2)[19][31]
Want managed judge models, not DIY Patronus AI Proprietary Lynx/GLIDER/Percival judges[25]

Workshop framing (2h hands-on): demo the DeepEval pytest gate failing a PR on a regressed golden-dataset case, then push those same traces to a self-hosted Langfuse for the dashboard view — that one flow shows both axes and is reproducible on any client’s infra without a license.[1][38][41]

Citations · 41 sources

Click the Citations tab to load…