TL;DR — Run 2–3 exercises, not ten. Expert developers disengage from trivially simple katas; start with a warm-up that calibrates the room (15 min), then move immediately to a realistic problem. The dominant scaffolding pattern: failing tests pre-written, participants implement with AI — this enforces TDD discipline without debating whether to write tests first. Use a Dev Container to eliminate environment setup as a time sink. [1] [2]
Technical Scaffolding
The biggest practical decision is eliminating environment friction before any coding starts. Centric Consulting’s 10-lab workshop ⭐ 6 [5] solves this with a Dev Container participants open in one click — no local toolchain required. Key components of a production-ready workshop scaffold:
- Dev Container / GitHub Codespaces — one config, everyone runs the same runtime + AI extension. Centric offers a choice of .NET, Spring Boot, or bilingual containers [5].
.github/copilot-instructions.mdor.cursor/rules— encode TDD rules (“write the failing test first; never implement without a red test”) so the AI tool itself enforces the discipline [18].- Skeleton repo with failing tests pre-written — participants clone a repo where tests exist but implementation stubs are empty, then make them pass with their AI tool [2].
- Reference solution on a separate branch — unblocks stuck participants without spoiling the exercise for everyone else [5].
prompts/folder — participants log each AI prompt chronologically as they work, making debrief discussions on prompt quality concrete rather than hypothetical [4].- CI on every push — instant green/red signal via GitHub Actions; removes the facilitator as a validation bottleneck [2].
Caltech’s one-day format [6] adds reusable checklists and an “LLM-in-the-loop” operating pattern participants take home. Provide both a reference impl branch and a take-home checklist; the checklist is what sticks.
Exercise Catalog
For a 90-minute or half-day session, pick one warm-up + one main exercise. The table below spans the full range:
| Exercise | Type | Duration | Difficulty | Key learning |
|---|---|---|---|---|
| String Calculator | Greenfield TDD | 15–20 min | Warm-up | Incremental test growth; calibration |
| Tetris skeleton | Greenfield AI TDD | 45–60 min | Medium | Failing-tests-first with AI |
| Goose Game | Greenfield AI TDD | 45–60 min | Medium | Prompt engineering + RGR loop |
| Gilded Rose | Legacy refactoring | 40–60 min | Medium-High | Characterization tests with AI |
| Trip Service | Dependency breaking | 45–60 min | High | Seam isolation before adding tests |
| EXACT mini-project | Full EXACT workflow | 60–90 min | Expert | Example Mapping → AI-TDD synthesis |
String Calculator [16]: Roy Osherove’s canonical warm-up. Starts with add("") == 0, incrementally adds comma delimiters, newlines, custom delimiters, and negatives. For an expert audience, run it twice: once without AI (baseline), once with AI (measure delta). The comparison lands more convincingly than any slide. See also garora/TDD-Katas ⭐ 735 [22] for a broader multi-language kata collection.
Skeleton katas [2] [4]: both pre-write the full test suite; participants use their AI tool to make tests pass. Eficode’s Tetris ⭐ 0 phases tests across board initialization, movement mechanics, line clearing, and scoring. Goose Game ⭐ 3 (Kotlin) adds a prompts/ log that becomes debrief material on where AI needed disambiguation [4].
Legacy katas [9]: The Gilded Rose is the entry point — pure nested conditionals, no external dependencies, so participants focus on writing characterization tests before touching any logic. Trip Service escalates: participants must break HTTP/DB dependencies to get code under test, a problem AI alone cannot reliably solve without human seam design. Bourgau’s dojo progression [8] maps the full 4-stage path from FizzBuzz to Ugly Trivia if you want to run a multi-session track.
Frameworks: TDAID and EXACT
Two frameworks add structure that’s worth teaching explicitly alongside the exercises.
TDAID (Test-Driven AI Development) [3] extends the classic loop with a Plan phase before Red and a Validate phase after Refactor. Plan: AI generates a structured implementation roadmap. Validate: human reviews the diff to ensure the agent didn’t “cheat” by writing tests that confirm broken behavior [1] [21]. Exercise shape for TDAID: write the plan as a comment block, let AI drive Red → Green → Refactor, then human-review the git diff before moving to the next increment.
EXACT Coding [12] (Example-guided AI-Collaborative Test-driven Coding) prepends an Example Mapping session to the first test: team clarifies story, rules, examples, and open questions in a short structured conversation. Three autonomy levels let participants choose their control posture:
| Level | AI runs until… | Recommended for |
|---|---|---|
| A | End of feature | Experienced AI users, speed |
| B | End of each RGR cycle | Default; balanced control |
| C | End of each phase | Learning mode; max oversight |
Level B is the default for workshop use — frequent enough to stay engaged, coarse enough that AI assistance feels meaningful [12]. The GitHub Copilot Workshop [15] structures a similar three-path progression (IDE features → pro/agents → CLI/SDK) that maps well onto beginner-to-expert cohorts.
For AI-specific techniques inside an exercise, Automattic’s pattern [11] is worth demonstrating: after writing one test, ask the AI to “triangulate examples” — it generates additional edge-case assertions from existing code structure, eliminating manual boilerplate. Similarly, GitHub’s /tests slash command [10] lets participants describe requirements in natural language and get AI-generated test scaffolding back in one step.
Anti-pattern Demo: Vibe Coding vs. TDD
Reserve 10–15 minutes for a live demonstration of the failure mode. Start a feature without tests, use AI to “just implement it,” add a second feature, observe the architecture degrade. Without tests, AI coding agents never spontaneously suggest refactoring, producing monolithic, tightly coupled code where each new feature takes longer than the last [20]. The fix isn’t discipline — it’s tests: they enforce interface stability, make hallucinated code fail immediately, and prevent the “refactoring avoidance” anti-pattern [14]. Show the vibe-coded diff alongside the TDD diff; expert developers will internalize the point without further argument.
Planning-first also helps: creating a mini-PRD or SPEC.md before prompting [13] shifts AI from free-wheeling code generator to constrained implementation engine — a pattern that pairs naturally with EXACT’s Example Mapping step.
Expert Audience Considerations
- Skip TDD theory. They know what red-green-refactor is. Spend that time on what changes with AI in the loop: the Validate phase, autonomy levels, prompt engineering.
- Use real-world complexity. Codely’s approach [7] of working on existing codebases (not greenfield toys) is more relevant and more engaging for senior developers.
- Pair strategically. Pair architects (who own test strategy and system design) with devs who drive agent prompts and the RGR cycle [19]. Knowledge transfer surfaces naturally without making it the explicit goal.
- Leave autonomy open. Let participants choose their AI tool’s EXACT autonomy level rather than mandating one [12]. Comparing Level A vs Level C choices in debrief is itself a rich discussion.
- Debrief the prompts, not the code. The most valuable expert discussion is about prompt quality: what context the AI needed, where it hallucinated, where it outperformed. The
prompts/log pattern [4] makes this concrete.
For open-source curricula to adapt, see GitHub Copilot Bootcamp [17] (4-week open curriculum with labs and weekly reflections) and Caltech CTME’s format [6] (6–8 hour intensive with enterprise customization).