Atlas survey

Live Regression Demo Design

How to design a 3-minute live regression demo that converts skeptics: the right scenario, promptfoo as the demo tool, cached responses for reliability, and scripted keypresses via Demo Time.

15 sources ~5 min read #206 evals · live-demo · promptfoo · regression-testing · conference-talk · llm · demo-design

Decision Use promptfoo ⭐ 22k [4] with --cache enabled. One scenario: a support-agent prompt edit that silently collapses refusal behavior — invisible to eyeballing, visible in the eval table. Three acts: manual “looks good” → quiet prompt edit → eval catches it. Record a 60-second fallback video before you go on stage. [8]

The right scenario

The demo has one job: show that eyeballing a single response doesn’t catch what evals catch across the suite. Pick a scenario where the regression is invisible per-response but visible as a pattern. [14]

The canonical example from the FutureAGI regression playbook [1]: a support agent that handles refund requests. You add one line to the system prompt — "Always respond in a warm, conversational tone" — and every individual response looks friendlier. But across a 15-case golden set, refusal rate on legitimate refund requests climbs 14 points. [2] Overall metrics may even improve (e.g. 0.91→0.93) while a specific cohort collapses (0.94→0.83). The eval catches this in 90 seconds; manual review never would.

This works because it passes the “I would have done that” test — every developer in the room has shipped a well-intentioned wording change that silently altered behavior elsewhere. The non-determinism and prompt sensitivity that make LLM regression testing hard [3] are exactly what makes the demo convincing.

Prepare two artefacts:

  • v1-prompt.txt — original system prompt
  • v2-warm-tone.txt — one line added: "Always respond in a warm, conversational tone."
  • promptfooconfig.yaml — 12–15 test cases: ~60% happy-path refunds, ~25% legitimate refund requests, ~15% edge-case phrasing [1]
  • Assertions: llm-rubric for tone quality; icontains for refusal detection

Keep the test suite at 12–15 cases (not the full golden set). Demo time is budget, not coverage.

The 3-act arc

Act 1 — The vibe check (30 sec) Show a terminal: curl the agent with a refund request. Response comes back. “Looks good, ship it.” Slide title: “This is how most teams test today.” The audience recognises it. [14]

Act 2 — The innocent edit (20 sec) Show git diff v1-prompt.txt v2-warm-tone.txt. One line added. Ask the room: “Would you approve this PR?” Most nods. [1]

Act 3 — The eval catches it (2 min) Run promptfoo eval --config promptfooconfig.yaml. The pass/fail table populates live in terminal. Red cells appear on the refund-rejection cases. Run promptfoo view — the browser opens a scatter plot where v2 pulls left (lower score) on the refusal rubric. [5] [6]

The punchline, said aloud: “The single-response curl looked fine. Fourteen cases disagreed.” That’s the talk’s thesis, demonstrated.

Tool: promptfoo for demo work

promptfoo ⭐ 22k [4] wins for live demo use over DeepEval and Langfuse for three concrete reasons [15]:

Reason Detail
Terminal output Pass/fail table prints live as tests run — audience sees it building in real-time
YAML config promptfooconfig.yaml is readable when projected; pytest files are not
Web viewer promptfoo view opens browser table + scatter plot in one command [5]

The promptfoo-demo-evals ⭐ 0 repo [7] provides a language-learning guided use case — useful as a structural reference even if you don’t run it verbatim.

DeepEval is better if your team is Python-native and wants assert statements in pytest. But projecting pytest output live is awkward; the failure messages are verbose and test IDs are opaque to an audience unfamiliar with the codebase. [15]

Keeping the demo safe

Do not rely on temperature=0. It governs only the token-selection rule, not floating-point inference on parallel hardware. Two calls to the same model with the same prompt can diverge even at temp=0, especially across Mixture-of-Experts routing. [12]

Use a layered safety strategy instead:

  1. Cache firstpromptfoo eval --cache stores responses keyed to (prompt + model + params). First run hits the API; every subsequent run (rehearsal, live stage) is instant and identical. [13]
  2. Local model fallback — wire the config to Ollama for fully offline determinism. Smaller model = weaker rubric quality, but demo consistency beats live variance. [4]
  3. Pre-recorded backup — record a 60-second terminal capture of the demo working. Load it in a video player on a hidden virtual desktop. Switch to it silently if the live run hangs or the API rate-limits on conference Wi-Fi. [8]

API key hygiene: Set OPENAI_API_KEY in your shell session before the talk, not inline in the config. Keep a spare key in a .env file. Conference Wi-Fi sometimes rate-limits unfamiliar source IPs.

Presentation mechanics

Demo Time [9] is a VS Code extension that scripts every step as a keypress — pre-define each file edit, terminal command, and highlight; execute with a single key. No typos, no forgotten flags, no context switch between slides and terminal. [10] Used at Microsoft Ignite and OpenAI DevDay. [11]

Configure the environment before going on stage:

  • Light theme (dark terminals wash out on conference projectors)
  • Terminal font size 28–32pt
  • Block cursor (thin blinking cursors disappear at distance)
  • Hide tabs, file tree, status bar
  • Close all other terminal windows and browser tabs

Rehearse the full demo at least 10 times. Run it once on a clean machine with a fresh shell. Run it once with Wi-Fi disabled to confirm the cache path works. [8]

Time budget

Segment Time Notes
Act 1: vibe check 30 sec Single curl, “looks fine”
Act 2: the edit 20 sec git diff, one-line change
Act 3: eval runs 90 sec Table populates, promptfoo view opens
Commentary/pause 40 sec “One rubric. Fourteen cases. Ship it?”
Total ~3 min Target for a 30-min talk

Over-rehearse so you can cut Act 1 if you’re running long. Never cut Act 3 — the pass/fail table is the talk’s evidence.

Citations · 15 sources

Click the Citations tab to load…