Wiring Evals into CI

TL;DR Most CI eval gates are smoke tests — tiny frozen datasets against absolute score floors where noise and signal live in the same range. Fix it with path-scoped triggers so evals run only on relevant changes, a layered eval pyramid (deterministic → LLM-as-judge) so cheap checks run first, and statistical delta gates (mean drop + Welch’s t + effect size) instead of absolute floors. [1] [2]

Why most CI eval gates don’t work

The canonical failure mode: add a pytest step, call the LLM, assert average score > 0.7. This gate passes on catastrophic regressions because: [1] [3]

Too small: 10–20 examples — too few for signal to clear LLM judge variance
Wrong baseline: absolute floor without delta comparison — a silent 15% drop stays “passing”
Too broad: runs on every commit, so the team disables it to cut costs
Uncalibrated judge: Pearson r < 0.7 vs human labels → measuring noise, not quality [4]

Model providers push silent base-model updates. Knowledge bases drift as they grow. Traffic distributions shift as real users find edge cases your test suite never imagined. [2] A pre-deploy snapshot gate misses all of this.

The two-layer architecture

Layer 1 — Pre-release gate runs on every relevant PR. Binary: pass or block. Uses your golden dataset. Catches regressions before merge. [2]

Layer 2 — Production monitoring runs the same eval suite continuously on 5–10% of live traffic. Catches drift between releases — the thing a pre-deploy gate cannot catch. [5]

Don’t conflate them. Layer 1 without Layer 2 gives false safety; Layer 2 without Layer 1 lets regressions reach users first.

Path-scoped triggers

Scope the workflow to paths that change model behaviour. [6] [7]

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'evals/**'
      - 'src/agent/**'
      - 'src/tools/**'

Add concurrency.cancel-in-progress: true keyed to github.head_ref so rapid successive pushes cancel stale competing runs rather than queuing them. [1]

The eval pyramid

Run cheaper layers first. A failing regex costs microseconds; a failing LLM judge costs tokens and minutes. Safety metrics block on any failure; quality metrics use delta gates. [4] [8]

Layer	Examples	Cost	Block on
Deterministic	JSON schema, PII detection, banned phrases	~0	Any failure
Heuristic	Semantic similarity, regex, length bounds	Low	Any failure
LLM-as-judge	Faithfulness, coherence, rubric scores	Medium	Statistical delta gate
Human review	Ambiguous cases, high-risk domains	High	On human flag

Statistical delta gates

An absolute floor is wrong for LLM-as-judge metrics. Use a three-part test that fires only when all three conditions hold — this prevents judge noise from triggering false alarms: [1]

Mean score dropped below the 7-day rolling baseline
Welch’s t-test yields p < 0.05
Effect size exceeds the noise floor (~0.03 on a 0–1 normalized scale)

Simpler setups: start with an absolute floor plus a delta check. Arize Phoenix’s experiments API uses experiment.get_evaluations()['score'].mean() > 0.8 as the gate and compares against the previous run. [18] Langfuse’s experiment-action fails the PR when the score misses a threshold measured against a versioned baseline, not an absolute. [9]

Threshold calibration: set thresholds so they fail on genuine regressions, not on noise — a threshold that fires constantly provides no signal, while one that never fires provides false confidence. [4]

Tool comparison

Five tools with production CI/CD support, June 2026. [17]

Tool	⭐ Stars	CI integration	PR comments	Self-host	Best for
DeepEval	16k	`pytest` native	via artifact	✓ free	pytest-native teams
Promptfoo	22k	`promptfoo-action@v1`	✓ native	✓ free	YAML-first, red teaming
Langfuse	28.7k	`experiment-action@v1`	✓ native	✓ free	Prompt management + evals
Braintrust	SaaS	dedicated GitHub Action	✓ native	Enterprise	Experiment tracking, score diffs
Arize Phoenix	10k	custom Python scripts	custom	✓ free	Observability-first

Star counts from GitHub API, June 2026: [10] [11] [12] [13]

DeepEval — pytest integration

assert_test() raises an assertion error when a metric falls below its threshold, blocking the CI run. Do not use evaluate() in CI — that collects results without failing the build. [3] [14]

- name: Run evals
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  run: poetry run deepeval test run tests/test_llm.py -n 4

The -n 4 flag parallelises across 4 workers. Add -r 2 to repeat each case twice; combined with a 1/2 minimum-pass assertion this tolerates one flaky-judge result per case.

Promptfoo — GitHub Action

- uses: promptfoo/promptfoo-action@v1     # ⭐ 69 [[6]](https://github.com/promptfoo/promptfoo-action)
  with:
    openai-api-key:    ${{ secrets.OPENAI_API_KEY }}
    github-token:      ${{ secrets.GITHUB_TOKEN }}
    config:            promptfooconfig.yaml
    fail-on-threshold: 90           # block if suite pass rate < 90 %
    repeat:            3            # re-run each case 3×
    repeat-min-pass:   2            # pass if ≥ 2/3 succeed
    cache-path:        .promptfoo-cache

Posts a PR comment with pass/fail counts and a link to the web viewer. The cache-path integrates with GitHub Actions cache to skip redundant API calls between runs. [6]

Langfuse — experiment gate

- uses: langfuse/experiment-action@v1.0.0
  with:
    langfuse_public_key: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
    langfuse_secret_key: ${{ secrets.LANGFUSE_SECRET_KEY }}
    langfuse_base_url:   https://cloud.langfuse.com
    github_token:        ${{ secrets.GITHUB_TOKEN }}
    experiment_path:     evals/run_experiment.py
    dataset_name:        golden-v2
    dataset_version:     "2026-05-15T00:00:00Z"

Blocks the PR when the experiment score drops below the threshold defined in your experiment script. Every run is tracked in Langfuse for audit trail. [9]

Dataset strategy

Quality beats quantity. Fifty expertly curated examples with clear acceptance criteria outperform 500 auto-generated ones. [5] Build the dataset from: [7]

Production logs — tag by difficulty and failure type
Domain-expert annotations — 25–50 human-labeled outcome cases first, before scaling; validate metric alignment at < 5% combined false positive/negative rate [15]
Known regressions — every production failure → permanent test case within 48 hours
Adversarial inputs — edge cases, jailbreaks, multilingual variants

Target 100–200 cases for a blocking gate. Version the dataset alongside prompts in Git so each PR references a specific snapshot. [4]

Judge calibration

Before deploying a judge at CI scale: validate it on your domain. Target Pearson r > 0.7 between judge scores and domain-expert verdicts — below that you’re measuring something, but it may not correlate with actual user experience. [4]

Start with a capable model (GPT-4o class). Annotate ~50 cases with real-world outcomes, not just expected scores. Once the judge consistently agrees with expert annotations → scale to full dataset runs. [16]

For structured agent output: G-Eval and DeepEval’s DAG metric work well. For RAG systems: Faithfulness + Answer Relevancy are the two non-negotiables — Faithfulness catches hallucinations, Answer Relevancy catches off-topic retrievals. [3]

Closing the loop: canary + production monitoring

Post-merge, route 1–5% of production traffic to the new version. Score both the canary and incumbent populations with the same rubric library. If rolling-mean drift exceeds the calibrated threshold → automatic rollback. [1]

Every production failure that reaches users becomes a permanent test case in the golden dataset within 48 hours. This is how the gate becomes progressively harder to fool — CI evals are not a checkpoint, they’re a flywheel. [5] [7]