Workshop Exercises & Scaffolding for AI-Assisted TDD

TL;DR — Run 2–3 exercises, not ten. Expert developers disengage from trivially simple katas; start with a warm-up that calibrates the room (15 min), then move immediately to a realistic problem. The dominant scaffolding pattern: failing tests pre-written, participants implement with AI — this enforces TDD discipline without debating whether to write tests first. Use a Dev Container to eliminate environment setup as a time sink. [1] [2]

Technical Scaffolding

The biggest practical decision is eliminating environment friction before any coding starts. Centric Consulting’s 10-lab workshop ⭐ 6 [5] solves this with a Dev Container participants open in one click — no local toolchain required. Key components of a production-ready workshop scaffold:

Dev Container / GitHub Codespaces — one config, everyone runs the same runtime + AI extension. Centric offers a choice of .NET, Spring Boot, or bilingual containers [5].
.github/copilot-instructions.md or .cursor/rules — encode TDD rules (“write the failing test first; never implement without a red test”) so the AI tool itself enforces the discipline [18].
Skeleton repo with failing tests pre-written — participants clone a repo where tests exist but implementation stubs are empty, then make them pass with their AI tool [2].
Reference solution on a separate branch — unblocks stuck participants without spoiling the exercise for everyone else [5].
prompts/ folder — participants log each AI prompt chronologically as they work, making debrief discussions on prompt quality concrete rather than hypothetical [4].
CI on every push — instant green/red signal via GitHub Actions; removes the facilitator as a validation bottleneck [2].

Caltech’s one-day format [6] adds reusable checklists and an “LLM-in-the-loop” operating pattern participants take home. Provide both a reference impl branch and a take-home checklist; the checklist is what sticks.

Exercise Catalog

For a 90-minute or half-day session, pick one warm-up + one main exercise. The table below spans the full range:

Exercise	Type	Duration	Difficulty	Key learning
String Calculator	Greenfield TDD	15–20 min	Warm-up	Incremental test growth; calibration
Tetris skeleton	Greenfield AI TDD	45–60 min	Medium	Failing-tests-first with AI
Goose Game	Greenfield AI TDD	45–60 min	Medium	Prompt engineering + RGR loop
Gilded Rose	Legacy refactoring	40–60 min	Medium-High	Characterization tests with AI
Trip Service	Dependency breaking	45–60 min	High	Seam isolation before adding tests
EXACT mini-project	Full EXACT workflow	60–90 min	Expert	Example Mapping → AI-TDD synthesis

String Calculator [16]: Roy Osherove’s canonical warm-up. Starts with add("") == 0, incrementally adds comma delimiters, newlines, custom delimiters, and negatives. For an expert audience, run it twice: once without AI (baseline), once with AI (measure delta). The comparison lands more convincingly than any slide. See also garora/TDD-Katas ⭐ 735 [22] for a broader multi-language kata collection.

Skeleton katas [2] [4]: both pre-write the full test suite; participants use their AI tool to make tests pass. Eficode’s Tetris ⭐ 0 phases tests across board initialization, movement mechanics, line clearing, and scoring. Goose Game ⭐ 3 (Kotlin) adds a prompts/ log that becomes debrief material on where AI needed disambiguation [4].

Legacy katas [9]: The Gilded Rose is the entry point — pure nested conditionals, no external dependencies, so participants focus on writing characterization tests before touching any logic. Trip Service escalates: participants must break HTTP/DB dependencies to get code under test, a problem AI alone cannot reliably solve without human seam design. Bourgau’s dojo progression [8] maps the full 4-stage path from FizzBuzz to Ugly Trivia if you want to run a multi-session track.

Frameworks: TDAID and EXACT

Two frameworks add structure that’s worth teaching explicitly alongside the exercises.

TDAID (Test-Driven AI Development) [3] extends the classic loop with a Plan phase before Red and a Validate phase after Refactor. Plan: AI generates a structured implementation roadmap. Validate: human reviews the diff to ensure the agent didn’t “cheat” by writing tests that confirm broken behavior [1] [21]. Exercise shape for TDAID: write the plan as a comment block, let AI drive Red → Green → Refactor, then human-review the git diff before moving to the next increment.

EXACT Coding [12] (Example-guided AI-Collaborative Test-driven Coding) prepends an Example Mapping session to the first test: team clarifies story, rules, examples, and open questions in a short structured conversation. Three autonomy levels let participants choose their control posture:

Level	AI runs until…	Recommended for
A	End of feature	Experienced AI users, speed
B	End of each RGR cycle	Default; balanced control
C	End of each phase	Learning mode; max oversight

Level B is the default for workshop use — frequent enough to stay engaged, coarse enough that AI assistance feels meaningful [12]. The GitHub Copilot Workshop [15] structures a similar three-path progression (IDE features → pro/agents → CLI/SDK) that maps well onto beginner-to-expert cohorts.

For AI-specific techniques inside an exercise, Automattic’s pattern [11] is worth demonstrating: after writing one test, ask the AI to “triangulate examples” — it generates additional edge-case assertions from existing code structure, eliminating manual boilerplate. Similarly, GitHub’s /tests slash command [10] lets participants describe requirements in natural language and get AI-generated test scaffolding back in one step.

Anti-pattern Demo: Vibe Coding vs. TDD

Reserve 10–15 minutes for a live demonstration of the failure mode. Start a feature without tests, use AI to “just implement it,” add a second feature, observe the architecture degrade. Without tests, AI coding agents never spontaneously suggest refactoring, producing monolithic, tightly coupled code where each new feature takes longer than the last [20]. The fix isn’t discipline — it’s tests: they enforce interface stability, make hallucinated code fail immediately, and prevent the “refactoring avoidance” anti-pattern [14]. Show the vibe-coded diff alongside the TDD diff; expert developers will internalize the point without further argument.

Planning-first also helps: creating a mini-PRD or SPEC.md before prompting [13] shifts AI from free-wheeling code generator to constrained implementation engine — a pattern that pairs naturally with EXACT’s Example Mapping step.

Expert Audience Considerations

Skip TDD theory. They know what red-green-refactor is. Spend that time on what changes with AI in the loop: the Validate phase, autonomy levels, prompt engineering.
Use real-world complexity. Codely’s approach [7] of working on existing codebases (not greenfield toys) is more relevant and more engaging for senior developers.
Pair strategically. Pair architects (who own test strategy and system design) with devs who drive agent prompts and the RGR cycle [19]. Knowledge transfer surfaces naturally without making it the explicit goal.
Leave autonomy open. Let participants choose their AI tool’s EXACT autonomy level rather than mandating one [12]. Comparing Level A vs Level C choices in debrief is itself a rich discussion.
Debrief the prompts, not the code. The most valuable expert discussion is about prompt quality: what context the AI needed, where it hallucinated, where it outperformed. The prompts/ log pattern [4] makes this concrete.

For open-source curricula to adapt, see GitHub Copilot Bootcamp [17] (4-week open curriculum with labs and weekly reflections) and Caltech CTME’s format [6] (6–8 hour intensive with enterprise customization).