Decision (consultancy shipping AI for clients). No single tool wins — build a two-layer stack: a code-first OSS eval framework that gates CI, plus an OSS platform you can self-host or run as SaaS per client.[1][37]
- Default pick → DeepEval ⭐ 16k + Langfuse ⭐ 28k. Apache-2.0 pytest gating with the deepest metric set, paired with MIT self-hostable tracing/datasets you can stand up per-client with zero license cost.[10][22][39]
- Pick Braintrust if clients pay for a polished managed UI, stakeholder dashboards and you don’t want to run infra — the strongest commercial experiment/annotation platform, $249/mo Pro, self-host only on Enterprise.[7][8][28]
- Pick LangSmith if the client’s app is already LangChain/LangGraph — native, lowest-cost commercial entry ($39/seat) — but accept ecosystem lock-in.[12][36]
- Pick Promptfoo ⭐ 22k for red-teaming / security (now OpenAI-owned, still MIT) and Inspect AI ⭐ 2.2k for safety/regulated, multi-provider capability evals.[2][4][17]
- Add Ragas ⭐ 14k when the deliverable is RAG and you want reference-free retrieval metrics.[15][16]
For expert consultants the load-bearing insight isn’t the tool — it’s that harness design swings published benchmark scores 10–20 points on SWE-Bench Verified, so the real asset is your client’s golden dataset + CI gates, and tools are interchangeable plumbing around it.[5][38]
The 2026 shape of the market
Two consolidation events reshaped the landscape this cycle:
- OpenAI acquired Promptfoo (announced 9 Mar 2026, ~$86M) — used by 125k devs and 25%+ of the Fortune 500. Red-teaming folds into OpenAI Frontier; the repo stays MIT and open.[2][3][33] For a consultancy shipping non-OpenAI models, the open caveat is vendor objectivity — pair it with an independent scorer (Inspect AI) if that matters to a client.[34]
- Humanloop is gone — acqui-hired by Anthropic (founders + ~12 staff, Aug 2025, no IP/assets), standalone platform sunset 8 Sep 2025, tech folded into the Anthropic Console. Don’t design a 2026 stack around it.[26][27]
- Two OpenAI “Evals” exist and are easy to conflate: the OSS
openai/evalsframework (MIT, ⭐ 19k, benchmark registry, still developed) versus the hosted Evals platform/API, which OpenAI has announced for deprecation alongside Agent Builder. Treat the hosted product as a dead end; the OSS repo is a benchmark-running tool, not a product eval platform.[13][14] - Braintrust raised ~$80M Series B → ~$800M valuation (Feb 2026) — the commercial eval-platform tier is consolidating around it and LangSmith.[40]
Master comparison
Stars are current as of June 2026 (GitHub API). ✓ = first-class/native, ~ = possible but not the tool’s focus, ✗ = not supported.
| Tool | ⭐ Stars | OSS / License | Self-host vs SaaS | CI/CD fit | Dataset / experiment mgmt | LLM-as-judge | Agent eval | RAG eval | Pricing | Lock-in |
|---|---|---|---|---|---|---|---|---|---|---|
| DeepEval | ⭐ 16k | OSS · Apache-2.0 | Self-host (free); SaaS = Confident AI | ✓ native deepeval test run (pytest) |
~ via Confident AI cloud | ✓ G-Eval, DAG | ✓ task completion, tool correctness | ✓ faithfulness, ctx recall/precision | Framework free; Confident AI $19.99–49.99/user; self-host @ Team/Ent | Low |
| Promptfoo | ⭐ 22k | OSS · MIT (OpenAI-owned) | Self-host / CLI; Enterprise SaaS | ✓ native GitHub Action (PR review) | ~ YAML cases, no rich UI | ✓ custom assertions | ~ red-team focus | ~ not primary | Free OSS; Enterprise on-prem | Low ⚠ vendor objectivity |
| Braintrust | — (closed) | Commercial · proprietary | SaaS; self-host = Enterprise only | ✓ GitHub Action → SDK | ✓ best-in-class UI, diffs, annotation | ✓ sandboxed-Python scorers | ✓ lifecycle metrics | ~ via custom scorer | Free → Pro $249/mo → Ent | Medium |
| LangSmith | — (closed) | Commercial · proprietary | SaaS; self-host = Enterprise add-on | ✓ native evaluator runs | ✓ trace-curated datasets | ✓ online + offline | ✓ native LangGraph | ~ general framework | Free 5k → Plus $39/seat → Ent | High (LangChain) |
| Ragas | ⭐ 14k | OSS · Apache-2.0 | Self-host (library) | ~ wrap in pytest/CI | ✗ (metrics lib) | ~ judge-backed metrics | ~ limited | ✓ purpose-built, reference-free | Free | Low |
| Inspect AI | ⭐ 2.2k | OSS · MIT (UK AISI) | Self-host (library + UI) | ~ bring-your-own GitHub Actions | ✓ Dataset→Task→Solver→Scorer | ✓ model-graded + custom | ✓ Docker-sandboxed agentic | ~ capability-level | Free | Low |
| OpenAI Evals | ⭐ 19k | OSS · MIT | Self-host only | ~ no runner; wrap in Actions | ✓ benchmark registry | ✓ model-graded YAML | ~ Completion-Fn protocol | ✗ | Free (hosted platform deprecating) | Low |
| Arize Phoenix | ⭐ 10k | OSS · Elastic 2.0 | Self-host (free); SaaS = Arize AX | ✓ code evaluators, LLM jury | ✓ score traces in-UI | ✓ Code Evaluators, LLM jury | ✓ OTel agent spans | ✓ embedding-based | OSS free; AX Free → AX Pro $50/mo | Low (OTel-native) |
| Langfuse | ⭐ 28k | OSS · MIT | Self-host free, unlimited; SaaS | ✓ SDK in CI; datasets/runs | ✓ datasets, runs, scores | ✓ LLM-as-judge templates | ✓ trace/span eval | ✓ trace eval | Self-host free; cloud $29 / $199 / $2,499 | Low |
Sources backing the cells above: DeepEval[9][10][11][30]; Promptfoo[2][4][6][28]; Braintrust[5][7][8]; LangSmith[12][35][36]; Ragas[15][16]; Inspect AI[17][18]; OpenAI Evals[13][14]; Phoenix[19][20][21][31]; Langfuse[22][23][29].
Two axes that actually decide it
Axis 1 — framework (CI gate) vs platform (UI + storage). DeepEval, Promptfoo, Ragas, Inspect AI and OpenAI Evals are frameworks: they run in code, exit non-zero on a failed assertion, and gate a PR. Braintrust, LangSmith, Langfuse and Phoenix are platforms: persistent storage, dataset curation, annotation UI, production monitoring. The 2026 consensus is you want one of each, not one tool doing both badly.[1][37]
Axis 2 — CI gating strength. DeepEval (deepeval test run) and Promptfoo (PR-review GitHub Action) are the only two with native, first-class CI gating; everyone else gates by wrapping their SDK/API in your own GitHub Actions glue. For a consultancy whose value proposition is “we ship tested AI,” that native gate is worth optimising for.[41][18]
Self-host & lock-in (the consultancy-critical column)
A consultancy hands deliverables to clients with varying data-residency rules. Self-host capability and license terms dominate.
| Capability | Free self-host? | License gotcha | Verdict for client work |
|---|---|---|---|
| Langfuse | ✓ unlimited, MIT | none | Best — drop on client infra, no fees[22][23] |
| DeepEval | ✓ framework; SaaS UI @ Team/Ent | Apache-2.0 | Best — pure-code, runs anywhere[9][10] |
| Inspect AI / Promptfoo / Ragas | ✓ | MIT / MIT / Apache-2.0 | Best — libraries, zero lock-in[15][17][4] |
| Phoenix | ✓ no feature gates, air-gappable | Elastic 2.0 — can’t resell it as a hosted service | Fine for internal/client deploys; ⚠ can’t white-label as your own SaaS[20][21] |
| Braintrust | ✗ Enterprise-only | proprietary | Locked to their cloud unless client buys Enterprise[7][8] |
| LangSmith | ✗ Enterprise add-on | proprietary | Enterprise custom contract; + LangChain ecosystem pull[12][35][36] |
Note the ELv2 trap for Phoenix: free to self-host for any client internally, but you may not offer it back as a managed/hosted service — relevant if your consultancy’s product is a hosted eval dashboard.[20] Langfuse’s MIT has no such restriction, and self-hosts at ~$1/GB-month vs Braintrust’s ~$3/GB — cheaper at scale.[29]
Worth-a-mention tier
- Langfuse ⭐ 28k — the most-starred OSS LLMOps platform; arguably the default open observability+eval+prompt-management layer for 2026, and the strongest Braintrust alternative when self-hosting matters.[22][32] Promoted into the main table above on merit.
- Evidently AI ⭐ 7.6k (Apache-2.0) — open-source eval/monitoring building blocks, not a turnkey review platform; reaching a dedicated tool’s maturity takes real engineering investment. Pick only if you want to assemble your own.[24]
- Patronus AI — evaluation-first, differentiated by proprietary judge models: Lynx (hallucination), GLIDER (rubric scoring), Percival (agent monitoring). Niche but real when you want managed, research-grade judges instead of rolling your own.[25]
- Humanloop — dead. Sunset 8 Sep 2025 after Anthropic acqui-hire; listed only so you don’t propose it.[26][27]
Pick-X-if (recommendation grid for a client-shipping consultancy)
| If the client situation is… | Pick | Why |
|---|---|---|
| Default / greenfield, data-residency varies | DeepEval + Langfuse | Apache/MIT, self-host anywhere, native pytest gate + free unlimited platform[10][22][39] |
| Client wants a managed, polished UI; budget OK | Braintrust Pro | Best experiment/annotation UX, diffs, sandboxed scorers, $249/mo[7][28] |
| App is LangChain/LangGraph | LangSmith | Native tracing/eval, $39/seat — lowest commercial entry, accept lock-in[12][36] |
| Security / red-team / regulated | Promptfoo (+Inspect AI) | MIT red-teaming; add Inspect AI for vendor-independent scoring[2][17][34] |
| Public-sector / safety / multi-provider capability evals | Inspect AI | UK AISI, MIT, 200+ evals across 10+ providers, Docker sandboxing[17][18] |
| Deliverable is RAG | Ragas (+DeepEval) | Reference-free retrieval metrics; DeepEval for agent/safety overlap[15][16] |
| Already OTel-instrumented | Arize Phoenix | OTel-native, score traces in-UI, no SDK lock-in (mind ELv2)[19][31] |
| Want managed judge models, not DIY | Patronus AI | Proprietary Lynx/GLIDER/Percival judges[25] |
Workshop framing (2h hands-on): demo the DeepEval pytest gate failing a PR on a regressed golden-dataset case, then push those same traces to a self-hosted Langfuse for the dashboard view — that one flow shows both axes and is reproducible on any client’s infra without a license.[1][38][41]