Teaching Evals: A 2h Hands-On Workshop (and Talk Variant) for Expert Devs

Decision. Build the workshop as a runnable lab arc on DeepEval ⭐ 16k [4] — Python-native, pytest-based, CI-ready — not on Promptfoo, which OpenAI acquired in March 2026 and which is now a strategic risk for a vendor-neutral teaching lab [6][7]. Teach evals the way Hamel Husain & Shreya Shankar do: error analysis first, judge second, gate third [1]. The single best “aha” is the planted-regression reveal: a green CI pipeline turns red on a one-line prompt edit the eye would miss. Run it hands-on if seats ≤ 30 and you can pre-ship a devcontainer; run the talk variant otherwise — laptop/setup chaos is the dominant failure mode of live labs [13].

Why DeepEval for the lab (and not Promptfoo)

The 2026 framework field narrowed to two leaders, then forked on ownership.

Axis	DeepEval ⭐ 16k	Promptfoo ⭐ 22k	Phoenix ⭐ 10k
Model	Python + pytest test cases [4]	Declarative YAML assertions [8]	OTel tracing + evals [16]
LLM-as-judge	`GEval`, 50+ metrics [3]	Built-in + custom graders [8]	Pre-built evaluators [16]
CI gate	`deepeval test run` → non-zero exit [5]	CLI + CI/CD configs [8]	Via test harness [16]
Golden dataset	`Golden`/`EvaluationDataset` [12]	YAML test rows [8]	Dataset objects [16]
2026 ownership risk	Independent (Confident AI)	⚠ Owned by OpenAI since Mar 2026 [7]	Independent (Arize)
Teaching fit	✓ Devs already know pytest	✓ Lowest setup, but config-not-code	Heavier; tracing-first

For an audience of expert software consultants, DeepEval’s pytest model is the right pedagogical lever: they already know assert, fixtures, and red/green CI — the lab teaches eval thinking, not a new tool’s DSL [4]. Promptfoo is genuinely lower-setup (YAML, no Python) [8], but two things kill it for this session: (1) OpenAI now owns it [6][7] — awkward for a vendor-neutral consultancy teaching client-facing rigor, with open community doubt about long-term provider neutrality [7]; and (2) YAML assertions hide the judge logic the workshop wants attendees to write. Mention Promptfoo as the “I want this in an afternoon, no Python” alternative, and Braintrust/Phoenix as the platform/tracing options [14][16].

The teaching spine: error-analysis-first

The canonical 2026 method is Hamel Husain & Shreya Shankar’s loop [1][2], and the workshop should inherit its order — most teams fail by jumping straight to metrics:

Open coding — skim 50-100 real traces (~30s each), jot what actually broke; no root-causing yet [1].
Axial coding — cluster notes into specific failure categories (e.g. “conversational flow”, “tool-call failure”), not vague ones [1].
Quantify — pivot-table the categories to see which failure dominates [1].
Build a binary LLM-as-judge for the top failure mode — true/false, not Likert, because shipping decisions are binary [1][15].
Align the judge to human labels — measure TPR and TNR separately; raw agreement is a trap when failures are rare (a “always-pass” judge scores 90%) [1].
Gate it — the golden set + judge become a regression suite in CI [5].

Compress steps 1-3 to a 10-minute taste in a 2h slot (full error analysis is its own hour); the runnable arc is steps 4-6. This mirrors how Evidently [9] and W&B [17] structure their applied tracks: custom judges → datasets → CI/monitoring.

The lab arc (what attendees build)

Ship a toy app with seeded failures — reuse the community-standard “Recipe Bot” shape [18] or any small RAG/agent. Four checkpoints, each a green tick before moving on:

#	Build	Concept landed	Cite
1	Golden dataset (20-40 rows)	`Golden`→`LLMTestCase`; ground truth from labeled traces	[12]
2	LLM-as-judge grader	`GEval` rubric, binary pass/fail, threshold tuning	[4]
3	Judge alignment check	TPR/TNR vs your hand-labels; iterate the rubric	[1]
4	CI regression gate	`deepeval test run` in GitHub Actions, non-zero = red	[5]

The planted regression (the payoff). Before the session, prepare a second branch where one prompt line is subtly degraded (e.g. drop “only use the provided context” → judge starts passing hallucinations). Attendees push it; green CI goes red automatically because the alignment-tuned judge catches what eyeballing the diff would not [5]. That red X is the whole session in one moment. For an advanced bonus, show 2026’s CI patterns: assert on pass_rate/avg_score/p50-p95 percentiles and route-tag goldens so a PR diff only re-runs affected routes [11].

Fixtures to ship attendees

goldens.csv / goldens.json — 20-40 labeled rows (DeepEval loads either) [19].
traces/ — 60-80 raw output traces for the 10-min error-analysis taste [1].
app/ — the toy bot, plus a regression branch with the planted defect.
.github/workflows/evals.yml — the gate, pre-written; attendees only flip the trigger [5].
Pre-recorded judge/CI run output as a fallback if API keys or network die.

Prerequisites & setup (the part that makes or breaks the live version)

Setup failure is the #1 hands-on-workshop killer [13]. Mitigate hard:

Ship a devcontainer / Codespace with deepeval, pytest, and deps pinned — one click, no local Python roulette [4].
Pre-distribute API keys (or a shared proxy with a budget cap); never have 30 people make accounts live. Set DEEPEVAL_RESULTS_FOLDER for local JSON so nobody needs a cloud login [4].
Offline judge option — point GEval at a small local model so a dead network doesn’t kill the room.
A pre-seeded GitHub repo per attendee (or a fork button) so the CI step actually runs Actions [5].
Prereq for attendees: comfortable with Python + pytest + git PRs. Skip 101 — they’re experts.

Minute-by-minute (2h hands-on)

Time	Segment	Mode
0:00-0:10	Why evals; the planted-regression promise (cold open the payoff)	talk
0:10-0:20	Error analysis taste: open→axial on `traces/`	hands-on
0:20-0:25	Buffer / setup triage	—
0:25-0:45	CP1 build golden dataset	hands-on
0:45-1:10	CP2 write the `GEval` judge	hands-on
1:10-1:30	CP3 align judge: TPR/TNR, iterate rubric	hands-on
1:30-1:35	Break / buffer	—
1:35-1:55	CP4 wire the CI gate; push the planted regression → red	hands-on
1:55-2:00	Debrief: what to do Monday, prod monitoring next step	talk

Buffers at 0:20 and 1:30 are non-negotiable — labs always run long, and CP2 (judge-writing) is where people stall. If time slips, cut CP3’s depth, never CP4 — the red-CI reveal is the session.

Live demos & “aha” moments (ranked)

Green→red CI on a one-line prompt edit — the headline; eyeballing missed it, the judge didn’t [5].
The “always-pass” judge scoring 90% accuracy — show the trap metric live, then split into TPR/TNR and watch the judge look terrible [1].
Binary vs Likert — let two attendees grade the same output on 1-5 and disagree, then re-grade binary and converge [1].
Rubric edit → score flip — tighten one judge sentence, re-run, watch a borderline case flip; evals are code you debug [15].
Spot-check the judge — sample 5-10% of verdicts against human calls to keep it honest [10].

Workshop vs talk: the tradeoff

A workshop’s value is hands-on muscle memory; a talk’s is reach and zero setup risk [13].

Factor	Hands-on workshop	Talk-only variant
What sticks	Muscle memory: they’ve built a gate	Mental model + a repo to try later
Audience cap	≤ ~30 (support per person) [13]	Unbounded
Failure mode	⚠ Setup/laptop/key chaos eats time [13]	Passive; no “I did it” moment
Prep cost	High: devcontainer, keys, per-attendee repo	Low: slides + screen-recorded lab
Best when	Internal team, paid client, small cohort	Conference keynote, large meetup

Talk variant (45-60 min). Same spine, you drive: cold-open the red CI, then rewind and narrate building golden set → judge → alignment → gate using pre-recorded terminal clips (live judge calls are slow and flaky on stage). End by handing out the repo so attendees run CP1-4 themselves. This is also your fallback if the hands-on room melts down — switch to driving from your machine and keep the payoff intact.

Hybrid (recommended default for Itenium): demo-driven talk with two short “you try it” beats (write the judge rubric; push the regression). Captures most of the muscle-memory value while bounding setup risk — and degrades gracefully to a pure talk if the room’s environments fail.

scout: standard depth, single pass. DeepEval star count via GitHub API on 2026-06-04.