Decision Use promptfoo ⭐ 22k [4] with
--cacheenabled. One scenario: a support-agent prompt edit that silently collapses refusal behavior — invisible to eyeballing, visible in the eval table. Three acts: manual “looks good” → quiet prompt edit → eval catches it. Record a 60-second fallback video before you go on stage. [8]
The right scenario
The demo has one job: show that eyeballing a single response doesn’t catch what evals catch across the suite. Pick a scenario where the regression is invisible per-response but visible as a pattern. [14]
The canonical example from the FutureAGI regression playbook [1]: a support agent that handles refund requests. You add one line to the system prompt — "Always respond in a warm, conversational tone" — and every individual response looks friendlier. But across a 15-case golden set, refusal rate on legitimate refund requests climbs 14 points. [2] Overall metrics may even improve (e.g. 0.91→0.93) while a specific cohort collapses (0.94→0.83). The eval catches this in 90 seconds; manual review never would.
This works because it passes the “I would have done that” test — every developer in the room has shipped a well-intentioned wording change that silently altered behavior elsewhere. The non-determinism and prompt sensitivity that make LLM regression testing hard [3] are exactly what makes the demo convincing.
Prepare two artefacts:
v1-prompt.txt— original system promptv2-warm-tone.txt— one line added:"Always respond in a warm, conversational tone."promptfooconfig.yaml— 12–15 test cases: ~60% happy-path refunds, ~25% legitimate refund requests, ~15% edge-case phrasing [1]- Assertions:
llm-rubricfor tone quality;icontainsfor refusal detection
Keep the test suite at 12–15 cases (not the full golden set). Demo time is budget, not coverage.
The 3-act arc
Act 1 — The vibe check (30 sec)
Show a terminal: curl the agent with a refund request. Response comes back. “Looks good, ship it.” Slide title: “This is how most teams test today.” The audience recognises it. [14]
Act 2 — The innocent edit (20 sec)
Show git diff v1-prompt.txt v2-warm-tone.txt. One line added. Ask the room: “Would you approve this PR?” Most nods. [1]
Act 3 — The eval catches it (2 min)
Run promptfoo eval --config promptfooconfig.yaml. The pass/fail table populates live in terminal. Red cells appear on the refund-rejection cases. Run promptfoo view — the browser opens a scatter plot where v2 pulls left (lower score) on the refusal rubric. [5] [6]
The punchline, said aloud: “The single-response curl looked fine. Fourteen cases disagreed.” That’s the talk’s thesis, demonstrated.
Tool: promptfoo for demo work
promptfoo ⭐ 22k [4] wins for live demo use over DeepEval and Langfuse for three concrete reasons [15]:
| Reason | Detail |
|---|---|
| Terminal output | Pass/fail table prints live as tests run — audience sees it building in real-time |
| YAML config | promptfooconfig.yaml is readable when projected; pytest files are not |
| Web viewer | promptfoo view opens browser table + scatter plot in one command [5] |
The promptfoo-demo-evals ⭐ 0 repo [7] provides a language-learning guided use case — useful as a structural reference even if you don’t run it verbatim.
DeepEval is better if your team is Python-native and wants assert statements in pytest. But projecting pytest output live is awkward; the failure messages are verbose and test IDs are opaque to an audience unfamiliar with the codebase. [15]
Keeping the demo safe
Do not rely on temperature=0. It governs only the token-selection rule, not floating-point inference on parallel hardware. Two calls to the same model with the same prompt can diverge even at temp=0, especially across Mixture-of-Experts routing. [12]
Use a layered safety strategy instead:
- Cache first —
promptfoo eval --cachestores responses keyed to (prompt + model + params). First run hits the API; every subsequent run (rehearsal, live stage) is instant and identical. [13] - Local model fallback — wire the config to Ollama for fully offline determinism. Smaller model = weaker rubric quality, but demo consistency beats live variance. [4]
- Pre-recorded backup — record a 60-second terminal capture of the demo working. Load it in a video player on a hidden virtual desktop. Switch to it silently if the live run hangs or the API rate-limits on conference Wi-Fi. [8]
API key hygiene: Set OPENAI_API_KEY in your shell session before the talk, not inline in the config. Keep a spare key in a .env file. Conference Wi-Fi sometimes rate-limits unfamiliar source IPs.
Presentation mechanics
Demo Time [9] is a VS Code extension that scripts every step as a keypress — pre-define each file edit, terminal command, and highlight; execute with a single key. No typos, no forgotten flags, no context switch between slides and terminal. [10] Used at Microsoft Ignite and OpenAI DevDay. [11]
Configure the environment before going on stage:
- Light theme (dark terminals wash out on conference projectors)
- Terminal font size 28–32pt
- Block cursor (thin blinking cursors disappear at distance)
- Hide tabs, file tree, status bar
- Close all other terminal windows and browser tabs
Rehearse the full demo at least 10 times. Run it once on a clean machine with a fresh shell. Run it once with Wi-Fi disabled to confirm the cache path works. [8]
Time budget
| Segment | Time | Notes |
|---|---|---|
| Act 1: vibe check | 30 sec | Single curl, “looks fine” |
| Act 2: the edit | 20 sec | git diff, one-line change |
| Act 3: eval runs | 90 sec | Table populates, promptfoo view opens |
| Commentary/pause | 40 sec | “One rubric. Fourteen cases. Ship it?” |
| Total | ~3 min | Target for a 30-min talk |
Over-rehearse so you can cut Act 1 if you’re running long. Never cut Act 3 — the pass/fail table is the talk’s evidence.