AI-Assisted TDD Workshop Playbook

expedition · 79 citations · Sonnet 4.6 · 2026-06-03 · facilitator ops dashboard

⏱ 90 min session 🧪 60–70% hands-on 👥 3 facilitation roles ⚡ 4 sub-research areas 📚 79 citations

critical constraint — embed in exercise sheets, .github/copilot-instructions.md, and opening frame

⛔

"You may not modify the test file."

Without this, expert developers find the cheat in under 10 minutes and dismiss the entire loop. It redirects AI optimization from "make it green by any means" to "write a correct implementation." Adding standard TDD procedural instructions without a dependency context map worsened regression rates from 6.08% to 9.94% — worse than no TDD instruction at all. ^[6]

session timeline — 90 minutes

00:00 Check-in

00:10 Context Frame

00:20 Exercise Block 1 — 35 min

00:55 ⏸

01:00 Exercise Block 2 — 30 min

01:30 Debrief

01:40 Takeaways

Setup / Frame (20 min) Hands-on exercises (65 min) Break — non-negotiable Synthesis (20 min)

opening frame evidence — name the gap before the first exercise

+55.8%

GitHub Copilot RCT speedup
bounded toy task

[1] arxiv · 2023

−19%

METR RCT: same experienced devs
on their own codebases, Cursor Pro

[2] metr.org · 2025

+20%

What those devs estimated —
the self-perception gap

[3] arxiv · 2025

92%

Copilot /tests failure-or-empty rate
with no seed tests

[5] testomat.io · 2025

9.94%

Regression rate: TDD instructions
without context map (up from 6.08%)

[6] TDAD · arxiv 2026

1.82%

Regression rate: TDD instructions
plus dependency context map

[6] TDAD · arxiv 2026

exercise catalog — warm-up + main exercise

String Calculator

Greenfield TDD · run twice: baseline then AI to show delta

15–20 min

warm-up

Tetris Skeleton

Pre-written failing tests · implement with AI

45–60 min

medium

Goose Game ⭐ 3

Kotlin · prompts/ log → concrete debrief material

45–60 min

medium

Gilded Rose

Legacy refactoring · characterization tests before touching logic

40–60 min

med-hard

Trip Service

Dependency breaking · AI alone cannot reliably solve seam design

45–60 min

hard

EXACT Mini-project

Example Mapping → AI-TDD synthesis · all three autonomy levels

60–90 min

expert

⚠ Anti-pattern demo · 10–15 min

Vibe-code a feature without tests → add second feature → observe architecture degrade live. AI agents never spontaneously suggest refactoring without test constraints. ^[7] Show the vibe-coded diff alongside a TDD diff. Expert devs internalize it without argument.

facilitation roles — never let lead touch zoom controls

🎙 LEAD required

Delivers content, runs exercises, timeboxes discussions
Names AI limits explicitly in context frame
Calls the break at 00:55 — hard stop
Never also operates Zoom/Meet controls

🎛 PRODUCER required

Manages polls, breakouts, recordings, visible countdown timer
Watches chat; voices questions to Lead without interrupting
Posts exercise instructions in chat (verbal-only gets missed)
Maintains parking lot board (Miro / FigJam)

🛠 HELPER(S) 1 per room of 4–6

Joins each breakout room first 3 min; confirms exercise loaded
Demos on own screen — never takes over participant's keyboard
Only needs to know one exercise; comfort beats completeness

✓ what works

Peer credibility framing
Explicit sandbox safety
First win in < 5 min
Mixed-sceptic rooms
Silent brainstorm before open floor

✗ what fails

"This is the new standard"
Skipping AI limits discussion
All-sceptics or all-enthusiasts rooms
Lecture-mode > 15 min

pre-workshop logistics — complete before session day

✓

Environment — GitHub Codespaces

Commit .devcontainer/devcontainer.json with prebuild enabled
Language runtime + test framework (e.g. Node 22 + Vitest)
AI extension pre-authenticated inside the container
Skeleton repo: failing tests present, implementation stubs empty
Reference solution on separate branch (unblocks without spoiling)
CI on every push — instant green/red signal
⚠ Free tier: 60 hrs/month — provide credit vouchers

API Keys — Per-Participant

Hard budget cap: $2–5 per key
Expiry: session day + 24 hrs
Model allowlist: workshop model only
Email distribution link 48 hrs out with curl test snippet
Claims window: opens 1 hr before start
Shared key = security risk for >10 participants

72h

Participant Pre-Check (72 hrs out)

GitHub account; Codespace opens from workshop link
AI extension authenticated; "hello world" generation passes
API key claimed; curl snippet returns valid response
Zoom desktop client installed (browser breaks breakout screen share)
Hard gate — not optional prep

🏠

Breakout Room Setup

Pre-assign groups — never random (experts resent the kindergarten feel)
2–3 devs per room; 4–5 maximum before collaboration degrades
Mix skill levels; moderate cognitive diversity
Helper assigned to each room before session starts
Brief all three roles together 10 min before session

frameworks — teach alongside exercises

TDAID — Test-Driven AI Development

Extends classic red-green-refactor with a Plan phase before Red (AI generates implementation roadmap) and a Validate phase after Refactor (human reviews the diff to catch "cheat" tests).

Plan → Red → Green → Refactor → Validate

EXACT — Example-guided AI-Collaborative TDD

Prepends Example Mapping before the first test. Three autonomy levels — let participants choose and debrief the difference:

A	AI runs until end of feature	speed mode
B	AI runs until end of each RGR cycle	★ default
C	AI runs until end of each phase	max oversight

failure mode pre-mortem — 10 predictable failures

Failure mode	Prevention
⚙ Environment setup in live session Loses 20–30 min; derails all subsequent timings	Codespaces prebuild + mandatory pre-check 24 hrs before ^[9]
🔑 AI API key failure on day Blocks all exercises; kills workshop credibility	Pre-provision with expiry; day-before `curl` test required to claim key ^[10]
📺 Demo-heavy, hands-on-light Expert disengagement within 15 min	Hard rule: ≤7 min explanation before participants touch code; 60–70% of session must be hands-on ^[14]
🗣 Dominant expert hijacking discussion Others disengage; session follows one rabbit hole	Parking lot + timebox; round-robin debrief format; silent brainstorm before open floor ^[13]
❓ Exercise too ambiguous Participants stuck; helpers overwhelmed; pacing collapses	Test every exercise solo end-to-end before the session; embed "if stuck" hints as code comments in repo stub
🛠 Tool sprawl Cognitive overload; participants lose their place	One primary tool per task; introduce tools sequentially; avoid simultaneous Zoom + Miro + Slack + IDE ^[14]
👤 No helper in breakout rooms Stuck participants wait silently; frustration builds	1 helper per room of 4–6, briefed on exercise goals, arrives in room for first 3 min ^[11]
🚧 Expert resistance to AI tooling Overt scepticism infects room culture	Address AI limits explicitly in context frame; peer-champion framing; concrete first win in < 5 min ^[12]
⏰ Overrun debrief, no synthesis time Participants leave with open loops	Hard 10-min closing slot in run-of-show; parking lot absorbs overflow; written recap within 24 hrs
☕ No break in 90-min session Focus degrades in last 30 min; diminishing returns	5-min break at 00:55, non-negotiable even under time pressure

expedition sub-pages

survey · 22 citations

AI-assisted TDD techniques & evidence (2026)

Productivity paradox, prompt patterns, test quality limits, agentic approaches.

📖 7 min 22 citations

survey · 22 citations

Workshop exercises & scaffolding

Curated katas, scaffold patterns, TDAID and EXACT frameworks in detail.

📖 6 min 22 citations

survey · 15 citations

Facilitation, logistics & failure modes

Codespaces setup, API key provisioning, three facilitation roles, full failure-mode table.

📖 7 min 15 citations

survey · 20 citations

Tooling stack for the workshop

AI assistant comparison (Copilot/Cursor/Cline), test frameworks, devcontainer config, copy-paste TDD prompts.

📖 5 min 20 citations