Decision. Build the workshop as a runnable lab arc on DeepEval ⭐ 16k [4] — Python-native, pytest-based, CI-ready — not on Promptfoo, which OpenAI acquired in March 2026 and which is now a strategic risk for a vendor-neutral teaching lab [6][7]. Teach evals the way Hamel Husain & Shreya Shankar do: error analysis first, judge second, gate third [1]. The single best “aha” is the planted-regression reveal: a green CI pipeline turns red on a one-line prompt edit the eye would miss. Run it hands-on if seats ≤ 30 and you can pre-ship a devcontainer; run the talk variant otherwise — laptop/setup chaos is the dominant failure mode of live labs [13].
Why DeepEval for the lab (and not Promptfoo)
The 2026 framework field narrowed to two leaders, then forked on ownership.
| Axis | DeepEval ⭐ 16k | Promptfoo ⭐ 22k | Phoenix ⭐ 10k |
|---|---|---|---|
| Model | Python + pytest test cases [4] | Declarative YAML assertions [8] | OTel tracing + evals [16] |
| LLM-as-judge | GEval, 50+ metrics [3] |
Built-in + custom graders [8] | Pre-built evaluators [16] |
| CI gate | deepeval test run → non-zero exit [5] |
CLI + CI/CD configs [8] | Via test harness [16] |
| Golden dataset | Golden/EvaluationDataset [12] |
YAML test rows [8] | Dataset objects [16] |
| 2026 ownership risk | Independent (Confident AI) | ⚠ Owned by OpenAI since Mar 2026 [7] | Independent (Arize) |
| Teaching fit | ✓ Devs already know pytest | ✓ Lowest setup, but config-not-code | Heavier; tracing-first |
For an audience of expert software consultants, DeepEval’s pytest model is the right pedagogical lever: they already know assert, fixtures, and red/green CI — the lab teaches eval thinking, not a new tool’s DSL [4]. Promptfoo is genuinely lower-setup (YAML, no Python) [8], but two things kill it for this session: (1) OpenAI now owns it [6][7] — awkward for a vendor-neutral consultancy teaching client-facing rigor, with open community doubt about long-term provider neutrality [7]; and (2) YAML assertions hide the judge logic the workshop wants attendees to write. Mention Promptfoo as the “I want this in an afternoon, no Python” alternative, and Braintrust/Phoenix as the platform/tracing options [14][16].
The teaching spine: error-analysis-first
The canonical 2026 method is Hamel Husain & Shreya Shankar’s loop [1][2], and the workshop should inherit its order — most teams fail by jumping straight to metrics:
- Open coding — skim 50-100 real traces (~30s each), jot what actually broke; no root-causing yet [1].
- Axial coding — cluster notes into specific failure categories (e.g. “conversational flow”, “tool-call failure”), not vague ones [1].
- Quantify — pivot-table the categories to see which failure dominates [1].
- Build a binary LLM-as-judge for the top failure mode — true/false, not Likert, because shipping decisions are binary [1][15].
- Align the judge to human labels — measure TPR and TNR separately; raw agreement is a trap when failures are rare (a “always-pass” judge scores 90%) [1].
- Gate it — the golden set + judge become a regression suite in CI [5].
Compress steps 1-3 to a 10-minute taste in a 2h slot (full error analysis is its own hour); the runnable arc is steps 4-6. This mirrors how Evidently [9] and W&B [17] structure their applied tracks: custom judges → datasets → CI/monitoring.
The lab arc (what attendees build)
Ship a toy app with seeded failures — reuse the community-standard “Recipe Bot” shape [18] or any small RAG/agent. Four checkpoints, each a green tick before moving on:
| # | Build | Concept landed | Cite |
|---|---|---|---|
| 1 | Golden dataset (20-40 rows) | Golden→LLMTestCase; ground truth from labeled traces |
[12] |
| 2 | LLM-as-judge grader | GEval rubric, binary pass/fail, threshold tuning |
[4] |
| 3 | Judge alignment check | TPR/TNR vs your hand-labels; iterate the rubric | [1] |
| 4 | CI regression gate | deepeval test run in GitHub Actions, non-zero = red |
[5] |
The planted regression (the payoff). Before the session, prepare a second branch where one prompt line is subtly degraded (e.g. drop “only use the provided context” → judge starts passing hallucinations). Attendees push it; green CI goes red automatically because the alignment-tuned judge catches what eyeballing the diff would not [5]. That red X is the whole session in one moment. For an advanced bonus, show 2026’s CI patterns: assert on pass_rate/avg_score/p50-p95 percentiles and route-tag goldens so a PR diff only re-runs affected routes [11].
Fixtures to ship attendees
goldens.csv/goldens.json— 20-40 labeled rows (DeepEval loads either) [19].traces/— 60-80 raw output traces for the 10-min error-analysis taste [1].app/— the toy bot, plus aregressionbranch with the planted defect..github/workflows/evals.yml— the gate, pre-written; attendees only flip the trigger [5].- Pre-recorded judge/CI run output as a fallback if API keys or network die.
Prerequisites & setup (the part that makes or breaks the live version)
Setup failure is the #1 hands-on-workshop killer [13]. Mitigate hard:
- Ship a devcontainer / Codespace with
deepeval,pytest, and deps pinned — one click, no local Python roulette [4]. - Pre-distribute API keys (or a shared proxy with a budget cap); never have 30 people make accounts live. Set
DEEPEVAL_RESULTS_FOLDERfor local JSON so nobody needs a cloud login [4]. - Offline judge option — point
GEvalat a small local model so a dead network doesn’t kill the room. - A pre-seeded GitHub repo per attendee (or a fork button) so the CI step actually runs Actions [5].
- Prereq for attendees: comfortable with Python + pytest + git PRs. Skip 101 — they’re experts.
Minute-by-minute (2h hands-on)
| Time | Segment | Mode |
|---|---|---|
| 0:00-0:10 | Why evals; the planted-regression promise (cold open the payoff) | talk |
| 0:10-0:20 | Error analysis taste: open→axial on traces/ |
hands-on |
| 0:20-0:25 | Buffer / setup triage | — |
| 0:25-0:45 | CP1 build golden dataset | hands-on |
| 0:45-1:10 | CP2 write the GEval judge |
hands-on |
| 1:10-1:30 | CP3 align judge: TPR/TNR, iterate rubric | hands-on |
| 1:30-1:35 | Break / buffer | — |
| 1:35-1:55 | CP4 wire the CI gate; push the planted regression → red | hands-on |
| 1:55-2:00 | Debrief: what to do Monday, prod monitoring next step | talk |
Buffers at 0:20 and 1:30 are non-negotiable — labs always run long, and CP2 (judge-writing) is where people stall. If time slips, cut CP3’s depth, never CP4 — the red-CI reveal is the session.
Live demos & “aha” moments (ranked)
- Green→red CI on a one-line prompt edit — the headline; eyeballing missed it, the judge didn’t [5].
- The “always-pass” judge scoring 90% accuracy — show the trap metric live, then split into TPR/TNR and watch the judge look terrible [1].
- Binary vs Likert — let two attendees grade the same output on 1-5 and disagree, then re-grade binary and converge [1].
- Rubric edit → score flip — tighten one judge sentence, re-run, watch a borderline case flip; evals are code you debug [15].
- Spot-check the judge — sample 5-10% of verdicts against human calls to keep it honest [10].
Workshop vs talk: the tradeoff
A workshop’s value is hands-on muscle memory; a talk’s is reach and zero setup risk [13].
| Factor | Hands-on workshop | Talk-only variant |
|---|---|---|
| What sticks | Muscle memory: they’ve built a gate | Mental model + a repo to try later |
| Audience cap | ≤ ~30 (support per person) [13] | Unbounded |
| Failure mode | ⚠ Setup/laptop/key chaos eats time [13] | Passive; no “I did it” moment |
| Prep cost | High: devcontainer, keys, per-attendee repo | Low: slides + screen-recorded lab |
| Best when | Internal team, paid client, small cohort | Conference keynote, large meetup |
Talk variant (45-60 min). Same spine, you drive: cold-open the red CI, then rewind and narrate building golden set → judge → alignment → gate using pre-recorded terminal clips (live judge calls are slow and flaky on stage). End by handing out the repo so attendees run CP1-4 themselves. This is also your fallback if the hands-on room melts down — switch to driving from your machine and keep the payoff intact.
Hybrid (recommended default for Itenium): demo-driven talk with two short “you try it” beats (write the judge rubric; push the regression). Captures most of the muscle-memory value while bounding setup risk — and degrades gracefully to a pure talk if the room’s environments fail.
scout: standard depth, single pass. DeepEval star count via GitHub API on 2026-06-04.