Atlas survey

Wiring Evals into CI

How to wire LLM evaluations into CI: path-scoped triggers, a layered eval pyramid, statistical delta gates, and a tool comparison across DeepEval, Promptfoo, Langfuse, and Braintrust.

18 sources ~7 min read #206 evals · ci-cd · llm · testing · github-actions · promptfoo · deepeval · langfuse · regression-testing

TL;DR Most CI eval gates are smoke tests — tiny frozen datasets against absolute score floors where noise and signal live in the same range. Fix it with path-scoped triggers so evals run only on relevant changes, a layered eval pyramid (deterministic → LLM-as-judge) so cheap checks run first, and statistical delta gates (mean drop + Welch’s t + effect size) instead of absolute floors. [1] [2]

Why most CI eval gates don’t work

The canonical failure mode: add a pytest step, call the LLM, assert average score > 0.7. This gate passes on catastrophic regressions because: [1] [3]

  • Too small: 10–20 examples — too few for signal to clear LLM judge variance
  • Wrong baseline: absolute floor without delta comparison — a silent 15% drop stays “passing”
  • Too broad: runs on every commit, so the team disables it to cut costs
  • Uncalibrated judge: Pearson r < 0.7 vs human labels → measuring noise, not quality [4]

Model providers push silent base-model updates. Knowledge bases drift as they grow. Traffic distributions shift as real users find edge cases your test suite never imagined. [2] A pre-deploy snapshot gate misses all of this.

The two-layer architecture

Layer 1 — Pre-release gate runs on every relevant PR. Binary: pass or block. Uses your golden dataset. Catches regressions before merge. [2]

Layer 2 — Production monitoring runs the same eval suite continuously on 5–10% of live traffic. Catches drift between releases — the thing a pre-deploy gate cannot catch. [5]

Don’t conflate them. Layer 1 without Layer 2 gives false safety; Layer 2 without Layer 1 lets regressions reach users first.

Path-scoped triggers

Scope the workflow to paths that change model behaviour. [6] [7]

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'evals/**'
      - 'src/agent/**'
      - 'src/tools/**'

Add concurrency.cancel-in-progress: true keyed to github.head_ref so rapid successive pushes cancel stale competing runs rather than queuing them. [1]

The eval pyramid

Run cheaper layers first. A failing regex costs microseconds; a failing LLM judge costs tokens and minutes. Safety metrics block on any failure; quality metrics use delta gates. [4] [8]

Layer Examples Cost Block on
Deterministic JSON schema, PII detection, banned phrases ~0 Any failure
Heuristic Semantic similarity, regex, length bounds Low Any failure
LLM-as-judge Faithfulness, coherence, rubric scores Medium Statistical delta gate
Human review Ambiguous cases, high-risk domains High On human flag

Statistical delta gates

An absolute floor is wrong for LLM-as-judge metrics. Use a three-part test that fires only when all three conditions hold — this prevents judge noise from triggering false alarms: [1]

  1. Mean score dropped below the 7-day rolling baseline
  2. Welch’s t-test yields p < 0.05
  3. Effect size exceeds the noise floor (~0.03 on a 0–1 normalized scale)

Simpler setups: start with an absolute floor plus a delta check. Arize Phoenix’s experiments API uses experiment.get_evaluations()['score'].mean() > 0.8 as the gate and compares against the previous run. [18] Langfuse’s experiment-action fails the PR when the score misses a threshold measured against a versioned baseline, not an absolute. [9]

Threshold calibration: set thresholds so they fail on genuine regressions, not on noise — a threshold that fires constantly provides no signal, while one that never fires provides false confidence. [4]

Tool comparison

Five tools with production CI/CD support, June 2026. [17]

Tool ⭐ Stars CI integration PR comments Self-host Best for
DeepEval 16k pytest native via artifact ✓ free pytest-native teams
Promptfoo 22k promptfoo-action@v1 ✓ native ✓ free YAML-first, red teaming
Langfuse 28.7k experiment-action@v1 ✓ native ✓ free Prompt management + evals
Braintrust SaaS dedicated GitHub Action ✓ native Enterprise Experiment tracking, score diffs
Arize Phoenix 10k custom Python scripts custom ✓ free Observability-first

Star counts from GitHub API, June 2026: [10] [11] [12] [13]

DeepEval — pytest integration

assert_test() raises an assertion error when a metric falls below its threshold, blocking the CI run. Do not use evaluate() in CI — that collects results without failing the build. [3] [14]

- name: Run evals
  env:
    OPENAI_API_KEY: $
  run: poetry run deepeval test run tests/test_llm.py -n 4

The -n 4 flag parallelises across 4 workers. Add -r 2 to repeat each case twice; combined with a 1/2 minimum-pass assertion this tolerates one flaky-judge result per case.

Promptfoo — GitHub Action

- uses: promptfoo/promptfoo-action@v1     # ⭐ 69 [[6]](https://github.com/promptfoo/promptfoo-action)
  with:
    openai-api-key:    $
    github-token:      $
    config:            promptfooconfig.yaml
    fail-on-threshold: 90           # block if suite pass rate < 90 %
    repeat:            3            # re-run each case 3×
    repeat-min-pass:   2            # pass if ≥ 2/3 succeed
    cache-path:        .promptfoo-cache

Posts a PR comment with pass/fail counts and a link to the web viewer. The cache-path integrates with GitHub Actions cache to skip redundant API calls between runs. [6]

Langfuse — experiment gate

- uses: langfuse/experiment-action@v1.0.0
  with:
    langfuse_public_key: $
    langfuse_secret_key: $
    langfuse_base_url:   https://cloud.langfuse.com
    github_token:        $
    experiment_path:     evals/run_experiment.py
    dataset_name:        golden-v2
    dataset_version:     "2026-05-15T00:00:00Z"

Blocks the PR when the experiment score drops below the threshold defined in your experiment script. Every run is tracked in Langfuse for audit trail. [9]

Dataset strategy

Quality beats quantity. Fifty expertly curated examples with clear acceptance criteria outperform 500 auto-generated ones. [5] Build the dataset from: [7]

  • Production logs — tag by difficulty and failure type
  • Domain-expert annotations — 25–50 human-labeled outcome cases first, before scaling; validate metric alignment at < 5% combined false positive/negative rate [15]
  • Known regressions — every production failure → permanent test case within 48 hours
  • Adversarial inputs — edge cases, jailbreaks, multilingual variants

Target 100–200 cases for a blocking gate. Version the dataset alongside prompts in Git so each PR references a specific snapshot. [4]

Judge calibration

Before deploying a judge at CI scale: validate it on your domain. Target Pearson r > 0.7 between judge scores and domain-expert verdicts — below that you’re measuring something, but it may not correlate with actual user experience. [4]

Start with a capable model (GPT-4o class). Annotate ~50 cases with real-world outcomes, not just expected scores. Once the judge consistently agrees with expert annotations → scale to full dataset runs. [16]

For structured agent output: G-Eval and DeepEval’s DAG metric work well. For RAG systems: Faithfulness + Answer Relevancy are the two non-negotiables — Faithfulness catches hallucinations, Answer Relevancy catches off-topic retrievals. [3]

Closing the loop: canary + production monitoring

Post-merge, route 1–5% of production traffic to the new version. Score both the canary and incumbent populations with the same rubric library. If rolling-mean drift exceeds the calibrated threshold → automatic rollback. [1]

Every production failure that reaches users becomes a permanent test case in the golden dataset within 48 hours. This is how the gate becomes progressively harder to fool — CI evals are not a checkpoint, they’re a flywheel. [5] [7]

Citations · 18 sources

Click the Citations tab to load…