The expedition landed on two facts that should be in the facilitator’s opening frame, not buried in slides.
First, the productivity numbers are messier than vendor decks suggest. GitHub Copilot’s canonical RCT [1] showed a 55.8% speedup on a bounded toy task. METR’s RCT [2] ran experienced developers on their own real codebases using Cursor Pro — and measured a 19% slowdown. The same developers estimated they were 20% faster [3]. That self-perception gap is the workshop’s opening premise: participants are not here to be sold AI tooling; they are here to learn where structure makes AI assistance reliable instead of merely fast-feeling.
Second, AI test generators are not designed to catch bugs — they are designed to pass. Evaluation of CodiumAI CoverAgent and CoverUp found generated tests cannot detect existing bugs, actively validate faulty implementations, and structurally filter out tests that would expose bugs [4]. Copilot /tests with no seed tests produces a 92.45% failure-or-empty rate [5]. This is the vibe-coding failure mode in test clothing: AI produces coverage theater, not quality signal.
Where all four angles converge. The single most load-bearing constraint across the evidence, exercises, tooling, and facilitation pages is five words: “You may not modify the test file.” This redirects the model’s optimization from “make it green by any means” to “write a correct implementation.” It belongs on exercise sheets, in .github/copilot-instructions.md, and in the facilitator’s opening frame. Without it, experienced developers will discover the cheat in under ten minutes and dismiss the entire loop.
The counterintuitive TDAD finding. The 2026 TDAD paper [6] found that adding standard TDD procedural instructions to a coding agent’s prompt — without a dependency context map — worsened regression rates from 6.08% to 9.94%, worse than no intervention at all. With the context map, regressions dropped to 1.82%. The implication for the workshop frame: the goal is not to teach agents to follow TDD. It is to teach developers to structure AI work so tests serve as executable specifications. Procedure without context is worse than nothing; this finding should preempt any participant who asks “can’t we just put TDD rules in the system prompt?”
The vibe-coding contrast is the anchor exercise. The exercises page recommends a 10-minute anti-pattern demo: implement a feature without tests via AI, add a second feature, and observe architectural degradation in real time. AI coding agents never spontaneously suggest refactoring without test constraints [7]. This is the sharpest contrast with the vibe-coding workshop for non-technical audiences — expert developers internalize it from a live diff, no argument required.
Tooling drives exercise choice, not vice versa. Cursor’s YOLO mode [8] runs the full test loop autonomously — useful for facilitated demos, potentially counterproductive for hands-on learning (participants watch rather than drive). The facilitation plan resolves this tension: Cursor for demos, Copilot or Continue.dev for participant exercises. Exercises pre-populate failing tests; participants implement with AI. This pattern eliminates the need for participants to write good test specs under time pressure while enforcing the red step structurally.
What the TDAD dependency-map approach means in practice. As of mid-2026, no workshop-friendly tool ships test-impact context maps out of the box. Participants will leave knowing the highest-leverage agentic TDD pattern [6] exists but cannot apply it immediately in their IDE without custom tooling. That may be the most honest outcome a 90-minute session can produce — and naming it explicitly in the closing segment is more durable than pretending the field has solved agentic TDD.