Wiring LLM Evals into CI/CD: gates, flaky judges, and cost budgets (2026)

Decision. Wire evals as a three-tier cascade that runs cheapest-first: deterministic checks (schema/regex/exact-match, sub-ms, $0) gate a fine-tuned classifier, which gates a frontier LLM judge that only fires on the residual ambiguous sample [6][1]. Block the PR only on deterministic failures and on statistically-significant judge regressions (delta past a noise floor and p < 0.05); warn on everything inside the noise band [1]. Budget cents-per-PR by scoping to changed routes and caching LLM calls; reserve the full multi-thousand-example sweep for a nightly job, because a naive judge-on-everything PR gate can burn $250–$2,500/run on a 5k-example set [14]. Tooling: Promptfoo for declarative prompt gates, DeepEval for pytest-native unit tests, Braintrust/LangSmith for hosted experiment diffs.

The core problem: a judge in CI is a flaky test by construction

An LLM-as-judge call costs 5–50¢ and takes 100ms–3s [14], and its verdict varies run-to-run. Treat each CI run as standalone and “you can’t tell a noise-floor flake from a real regression” [1]. The 2026 consensus is not to pretend the judge is deterministic, but to design for variance: control judge temperature, rerun borderline cases, ensemble only high-impact decisions, and smooth scores over time [3][8]. A 30-example set yields confidence intervals of roughly ±0.07 on a 0–1 scale, so a 2-point drop is inside the noise; you need 100–200 examples per route for a t-test to separate signal from variance [1].

Pattern 1 — Deterministic gates the judge (the eval floor)

The cheapest and highest-signal layer is deterministic and runs first; the judge never fires on structurally-broken output [6].

Tier	Checks	Latency	Cost	Role
1 Deterministic	JSON schema, regex, exact-match, length contract, citation-presence, function-call arg validators	sub-ms	$0	Fail-fast floor; gates tiers 2–3
2 NLI classifier	faithfulness, claim_support, factual_consistency on CPU	10–50ms	~$0.001/call	Triage; routes ambiguous → tier 3
3 Frontier judge	open-ended rubrics (helpfulness, tone, correctness)	100ms–3s	5–50¢/call	Only on classifier disagreement

Sources: [6] [1] [14]

Function-call argument validation is “the single highest-signal deterministic check for agents” [6]. Fail-fast composition: if the response fails JSON schema, the judge does not run and the eval fails outright [6]. Deterministic checks alone catch 30–60% of production failures before any judge fires [6]. The reported effect: PR-gate cost drops from ~$27 (200 examples × 9 rubrics × $0.015) to $0.30–$0.60 [1]. ⚠ Don’t apply the cascade to subjective axes (helpfulness, brand voice) where the classifier has no ground truth [1].

Pattern 2 — Statistical regression gates, not frozen thresholds

A single absolute threshold (“fail if score < 0.8”) flaps on judge noise. The 2026 pattern keeps a rolling baseline in git and gates on a statistical delta [1].

A nightly workflow runs the full suite (500–2,000 examples) and commits updated per-rubric mean arrays back to evals/baselines/*.json — calibrating against production variance, not “frozen fiction” [1].
The PR gate fires only when all three hold: mean dropped (delta < 0), change is significant (p < 0.05 via Welch’s t-test on per-example score arrays), and effect exceeds the noise floor (|delta| >= 0.03) [1]. Example failure: regression: delta=-0.045, p=0.018 [1].
Commit the rolling 7-day result JSON to git so a run can compare against recent history rather than a single noisy point [3][1].

Exit-code contract keeps CI policy stable against SDK log changes — partition outcomes by code rather than grepping stdout [1]:

Code	Meaning	CI action
0	Pass	Merge allowed
2	Assertion failed	Hard-fail PR (block)
3	Warning (–strict)	Slack notify, don’t block
6	API error	Retry with backoff
7	Timeout	Increase `timeout-minutes`

Source: [1]

When to block vs warn

Situation	Verdict	Why
Deterministic check fails (schema, forbidden phrase, PII, bad tool args)	Block	Objective, zero false-positive, cheap [6]
Safety/red-team assertion fails	Block	Non-negotiable; Promptfoo runs red-team in CI [10]
Judge delta significant + past noise floor	Block	Real regression, not flake [1]
Judge delta inside noise band / single-run drop	Warn	Indistinguishable from judge variance [1]
Subjective axis (tone, helpfulness) small move	Warn	Smooth + monitor over time, don’t hard-gate [3]
API/timeout error in the eval harness	Retry, then warn	Infra noise ≠ quality regression [1]

Evals belong in CI and in production: run the same judge-based scorers in CI and against live traffic so version-to-version score comparisons stay meaningful [3]. Post-merge, route 1–5% of traffic to the new prompt (canary), score both populations with the same rubric library, and auto-rollback on a 2–3pp drift sustained over 15–60 min [1].

Cost & latency budgets per CI run

Lever	Mechanism	Effect
Per-route scoping	Matrix-shard by changed paths; fall back to full sweep only if shared paths (`src/shared/`, `evals/rubrics/`) touched	Bounds PR cost; full 1,600-example sweep stays in nightly [1]
Cascade / sampling	Classifier first; judge on 30–60 ambiguous cases not 200	70–85% cost cut [1]
Caching LLM calls	Reuse request/output across runs; Promptfoo `~/.cache/promptfoo`, Braintrust proxy `use_proxy`	Re-run unchanged tests ~free [9][2]
Concurrency cancel	Cancel superseded PR runs; sub-5-min verdict target	Keeps feedback loop tight [1]

Reference figures: a tuned PR gate hit 4 min, $3.20, covering 247 judge calls across 9 rubrics on 200 examples (~$0.20 deterministic+NLI, ~$3.00 frontier subset); the nightly 2,000-example run amortizes at ~$20–$30 [1]. The anti-pattern to avoid: judge-on-every-case across a 5,000-example regression set runs $250–$2,500 per CI run [14]. On the latency side, a judge adds 1,000ms+ and can double response time, which is why the judge ships off the hot path as a span-attached eval rather than inline [14][6].

Concrete tool wiring

Promptfoo — declarative prompt/red-team gate

promptfoo ⭐ 22k (Jun 2026) [10] ships promptfoo-action ⭐ 69 (Jun 2026) [7]. On a PR touching prompts/**, it runs a before/after eval and posts a comparison comment linking the web viewer [9]. Key inputs: github-token, prompts glob, config, cache-path [9]. Gating is either built-in (promptfoo eval --fail-on-error) or a jq quality-gate on the JSON output [4]:

PASS_RATE=$(jq '.results.stats.successes / (.results.stats.successes + .results.stats.failures) * 100' results.json)
if (( $(echo "$PASS_RATE < 95" | bc -l) )); then exit 1; fi

Output results.junit.xml for native CI viewers; cache via PROMPTFOO_CACHE_PATH=~/.cache/promptfoo + PROMPTFOO_CACHE_TTL=86400 and actions/cache@v4 keyed on hashFiles('prompts/**','promptfooconfig.yaml') [4]. Assertions mix deterministic and llm-rubric (model-graded) types in one declarative config [4]. Used by OpenAI and Anthropic internally [10].

DeepEval — pytest-native unit tests

deepeval ⭐ 16k (Jun 2026) [11] wraps pytest: deepeval test run test_llm_app.py, with each test calling assert_test(test_case, metrics=[...]) where metrics carry a per-metric threshold [12]. The command adds flags on top of pytest: -n (parallel processes), -r (repeats — rerun for variance), -c (cache), -i (ignore errors) [12]. In GitHub Actions it runs under OPENAI_API_KEY (judge) with optional CONFIDENT_API_KEY to ship results to the Confident AI platform; a metric below threshold fails the test → fails the job → blocks the PR [12].

Braintrust — hosted experiment diff

braintrust eval-action ⭐ 15 (Jun 2026) runs braintrust eval and posts a PR comment with baseline-relative scores like 0.83 (+3pp), flagging 🟢 improvements / 🔴 regressions [2]. Inputs: api_key, runtime (node/python), package_manager, use_proxy (defaults true — sets OPENAI_BASE_URL to cache LLM calls), terminate_on_failure [2]. Requires pull-requests: write + contents: read permissions to comment [2]. Thresholds live in the eval code / project settings, not as action inputs; the action blocks merges when scores fall below them [8].

LangSmith — pytest/Vitest + hosted experiments

LangSmith’s evals-cicd sample runs pytest -m evaluator, calls client.aevaluate() against a hosted dataset with OpenEvals LLM-as-judge, and gates on a criteria dict like "correctness": ">=0.8" [13]. A failed threshold → pytest failure → job failure → blocked PR; sys.exit(1) in the eval script is all that’s needed to fail the Action [13]. CI runs surface in the LangSmith UI under Experiments (grouped by a ci-regression prefix) for side-by-side run-over-run diffs [13].

Tool pick

If you want…	Use	Notes
Declarative prompt/RAG/red-team gate, no SDK	Promptfoo ⭐ 22k	YAML config, `--fail-on-error`, built-in caching [10][9]
pytest-native unit tests in an existing suite	DeepEval ⭐ 16k	`assert_test` + per-metric thresholds, `-r` repeats [11][12]
Hosted experiment diffs + team dashboards	Braintrust	PR comment with ±pp deltas, proxy caching [2]
Already on LangChain/LangSmith tracing	LangSmith ⭐ 0.9k	`aevaluate()` + pytest, Experiments diff UI [13]

langsmith-sdk ⭐ 917 (Jun 2026) [13].

Workshop-ready takeaways

Layer deterministic → classifier → judge; fail-fast so the judge never scores broken output [6].
Gate on a statistical delta vs a git-committed rolling baseline, not a frozen number; need 100–200 examples/route for the t-test [1].
Block deterministic + safety + significant judge regressions; warn on noise-band moves and subjective axes [1][3].
Budget with per-route scoping + cascade + LLM-call caching; keep PR cost in cents and push the full sweep to a nightly job [1].
Use a stable exit-code contract so CI policy survives SDK output churn [1].