Atlas survey

Teaching Evals: A 2h Hands-On Workshop (and Talk Variant) for Expert Devs

A runnable 2h evals lab on DeepEval — golden set → LLM-judge → CI gate → planted regression — with timing, fixtures, and the workshop-vs-talk call.

19 sources ~9 min read evals · llm · workshop · teaching · deepeval · ci · llm-as-judge

Decision. Build the workshop as a runnable lab arc on DeepEval ⭐ 16k [4] — Python-native, pytest-based, CI-ready — not on Promptfoo, which OpenAI acquired in March 2026 and which is now a strategic risk for a vendor-neutral teaching lab [6][7]. Teach evals the way Hamel Husain & Shreya Shankar do: error analysis first, judge second, gate third [1]. The single best “aha” is the planted-regression reveal: a green CI pipeline turns red on a one-line prompt edit the eye would miss. Run it hands-on if seats ≤ 30 and you can pre-ship a devcontainer; run the talk variant otherwise — laptop/setup chaos is the dominant failure mode of live labs [13].

Why DeepEval for the lab (and not Promptfoo)

The 2026 framework field narrowed to two leaders, then forked on ownership.

Axis DeepEval ⭐ 16k Promptfoo ⭐ 22k Phoenix ⭐ 10k
Model Python + pytest test cases [4] Declarative YAML assertions [8] OTel tracing + evals [16]
LLM-as-judge GEval, 50+ metrics [3] Built-in + custom graders [8] Pre-built evaluators [16]
CI gate deepeval test run → non-zero exit [5] CLI + CI/CD configs [8] Via test harness [16]
Golden dataset Golden/EvaluationDataset [12] YAML test rows [8] Dataset objects [16]
2026 ownership risk Independent (Confident AI) ⚠ Owned by OpenAI since Mar 2026 [7] Independent (Arize)
Teaching fit ✓ Devs already know pytest ✓ Lowest setup, but config-not-code Heavier; tracing-first

For an audience of expert software consultants, DeepEval’s pytest model is the right pedagogical lever: they already know assert, fixtures, and red/green CI — the lab teaches eval thinking, not a new tool’s DSL [4]. Promptfoo is genuinely lower-setup (YAML, no Python) [8], but two things kill it for this session: (1) OpenAI now owns it [6][7] — awkward for a vendor-neutral consultancy teaching client-facing rigor, with open community doubt about long-term provider neutrality [7]; and (2) YAML assertions hide the judge logic the workshop wants attendees to write. Mention Promptfoo as the “I want this in an afternoon, no Python” alternative, and Braintrust/Phoenix as the platform/tracing options [14][16].

The teaching spine: error-analysis-first

The canonical 2026 method is Hamel Husain & Shreya Shankar’s loop [1][2], and the workshop should inherit its order — most teams fail by jumping straight to metrics:

  1. Open coding — skim 50-100 real traces (~30s each), jot what actually broke; no root-causing yet [1].
  2. Axial coding — cluster notes into specific failure categories (e.g. “conversational flow”, “tool-call failure”), not vague ones [1].
  3. Quantify — pivot-table the categories to see which failure dominates [1].
  4. Build a binary LLM-as-judge for the top failure mode — true/false, not Likert, because shipping decisions are binary [1][15].
  5. Align the judge to human labels — measure TPR and TNR separately; raw agreement is a trap when failures are rare (a “always-pass” judge scores 90%) [1].
  6. Gate it — the golden set + judge become a regression suite in CI [5].

Compress steps 1-3 to a 10-minute taste in a 2h slot (full error analysis is its own hour); the runnable arc is steps 4-6. This mirrors how Evidently [9] and W&B [17] structure their applied tracks: custom judges → datasets → CI/monitoring.

The lab arc (what attendees build)

Ship a toy app with seeded failures — reuse the community-standard “Recipe Bot” shape [18] or any small RAG/agent. Four checkpoints, each a green tick before moving on:

# Build Concept landed Cite
1 Golden dataset (20-40 rows) GoldenLLMTestCase; ground truth from labeled traces [12]
2 LLM-as-judge grader GEval rubric, binary pass/fail, threshold tuning [4]
3 Judge alignment check TPR/TNR vs your hand-labels; iterate the rubric [1]
4 CI regression gate deepeval test run in GitHub Actions, non-zero = red [5]

The planted regression (the payoff). Before the session, prepare a second branch where one prompt line is subtly degraded (e.g. drop “only use the provided context” → judge starts passing hallucinations). Attendees push it; green CI goes red automatically because the alignment-tuned judge catches what eyeballing the diff would not [5]. That red X is the whole session in one moment. For an advanced bonus, show 2026’s CI patterns: assert on pass_rate/avg_score/p50-p95 percentiles and route-tag goldens so a PR diff only re-runs affected routes [11].

Fixtures to ship attendees

  • goldens.csv / goldens.json — 20-40 labeled rows (DeepEval loads either) [19].
  • traces/ — 60-80 raw output traces for the 10-min error-analysis taste [1].
  • app/ — the toy bot, plus a regression branch with the planted defect.
  • .github/workflows/evals.yml — the gate, pre-written; attendees only flip the trigger [5].
  • Pre-recorded judge/CI run output as a fallback if API keys or network die.

Prerequisites & setup (the part that makes or breaks the live version)

Setup failure is the #1 hands-on-workshop killer [13]. Mitigate hard:

  • Ship a devcontainer / Codespace with deepeval, pytest, and deps pinned — one click, no local Python roulette [4].
  • Pre-distribute API keys (or a shared proxy with a budget cap); never have 30 people make accounts live. Set DEEPEVAL_RESULTS_FOLDER for local JSON so nobody needs a cloud login [4].
  • Offline judge option — point GEval at a small local model so a dead network doesn’t kill the room.
  • A pre-seeded GitHub repo per attendee (or a fork button) so the CI step actually runs Actions [5].
  • Prereq for attendees: comfortable with Python + pytest + git PRs. Skip 101 — they’re experts.

Minute-by-minute (2h hands-on)

Time Segment Mode
0:00-0:10 Why evals; the planted-regression promise (cold open the payoff) talk
0:10-0:20 Error analysis taste: open→axial on traces/ hands-on
0:20-0:25 Buffer / setup triage
0:25-0:45 CP1 build golden dataset hands-on
0:45-1:10 CP2 write the GEval judge hands-on
1:10-1:30 CP3 align judge: TPR/TNR, iterate rubric hands-on
1:30-1:35 Break / buffer
1:35-1:55 CP4 wire the CI gate; push the planted regression → red hands-on
1:55-2:00 Debrief: what to do Monday, prod monitoring next step talk

Buffers at 0:20 and 1:30 are non-negotiable — labs always run long, and CP2 (judge-writing) is where people stall. If time slips, cut CP3’s depth, never CP4 — the red-CI reveal is the session.

Live demos & “aha” moments (ranked)

  1. Green→red CI on a one-line prompt edit — the headline; eyeballing missed it, the judge didn’t [5].
  2. The “always-pass” judge scoring 90% accuracy — show the trap metric live, then split into TPR/TNR and watch the judge look terrible [1].
  3. Binary vs Likert — let two attendees grade the same output on 1-5 and disagree, then re-grade binary and converge [1].
  4. Rubric edit → score flip — tighten one judge sentence, re-run, watch a borderline case flip; evals are code you debug [15].
  5. Spot-check the judge — sample 5-10% of verdicts against human calls to keep it honest [10].

Workshop vs talk: the tradeoff

A workshop’s value is hands-on muscle memory; a talk’s is reach and zero setup risk [13].

Factor Hands-on workshop Talk-only variant
What sticks Muscle memory: they’ve built a gate Mental model + a repo to try later
Audience cap ≤ ~30 (support per person) [13] Unbounded
Failure mode ⚠ Setup/laptop/key chaos eats time [13] Passive; no “I did it” moment
Prep cost High: devcontainer, keys, per-attendee repo Low: slides + screen-recorded lab
Best when Internal team, paid client, small cohort Conference keynote, large meetup

Talk variant (45-60 min). Same spine, you drive: cold-open the red CI, then rewind and narrate building golden set → judge → alignment → gate using pre-recorded terminal clips (live judge calls are slow and flaky on stage). End by handing out the repo so attendees run CP1-4 themselves. This is also your fallback if the hands-on room melts down — switch to driving from your machine and keep the payoff intact.

Hybrid (recommended default for Itenium): demo-driven talk with two short “you try it” beats (write the judge rubric; push the regression). Captures most of the muscle-memory value while bounding setup risk — and degrades gracefully to a pure talk if the room’s environments fail.


scout: standard depth, single pass. DeepEval star count via GitHub API on 2026-06-04.

Citations · 19 sources

Click the Citations tab to load…