Atlas survey

Workshop Exercises & Scaffolding for AI-Assisted TDD

Curated exercises, katas, and scaffolding patterns for a hands-on AI-assisted TDD mini-workshop targeting expert developers.

22 sources ~6 min read #182 workshop · tdd · ai-testing · exercises · scaffolding · copilot · kata

TL;DR — Run 2–3 exercises, not ten. Expert developers disengage from trivially simple katas; start with a warm-up that calibrates the room (15 min), then move immediately to a realistic problem. The dominant scaffolding pattern: failing tests pre-written, participants implement with AI — this enforces TDD discipline without debating whether to write tests first. Use a Dev Container to eliminate environment setup as a time sink. [1] [2]

Technical Scaffolding

The biggest practical decision is eliminating environment friction before any coding starts. Centric Consulting’s 10-lab workshop ⭐ 6 [5] solves this with a Dev Container participants open in one click — no local toolchain required. Key components of a production-ready workshop scaffold:

  • Dev Container / GitHub Codespaces — one config, everyone runs the same runtime + AI extension. Centric offers a choice of .NET, Spring Boot, or bilingual containers [5].
  • .github/copilot-instructions.md or .cursor/rules — encode TDD rules (“write the failing test first; never implement without a red test”) so the AI tool itself enforces the discipline [18].
  • Skeleton repo with failing tests pre-written — participants clone a repo where tests exist but implementation stubs are empty, then make them pass with their AI tool [2].
  • Reference solution on a separate branch — unblocks stuck participants without spoiling the exercise for everyone else [5].
  • prompts/ folder — participants log each AI prompt chronologically as they work, making debrief discussions on prompt quality concrete rather than hypothetical [4].
  • CI on every push — instant green/red signal via GitHub Actions; removes the facilitator as a validation bottleneck [2].

Caltech’s one-day format [6] adds reusable checklists and an “LLM-in-the-loop” operating pattern participants take home. Provide both a reference impl branch and a take-home checklist; the checklist is what sticks.

Exercise Catalog

For a 90-minute or half-day session, pick one warm-up + one main exercise. The table below spans the full range:

Exercise Type Duration Difficulty Key learning
String Calculator Greenfield TDD 15–20 min Warm-up Incremental test growth; calibration
Tetris skeleton Greenfield AI TDD 45–60 min Medium Failing-tests-first with AI
Goose Game Greenfield AI TDD 45–60 min Medium Prompt engineering + RGR loop
Gilded Rose Legacy refactoring 40–60 min Medium-High Characterization tests with AI
Trip Service Dependency breaking 45–60 min High Seam isolation before adding tests
EXACT mini-project Full EXACT workflow 60–90 min Expert Example Mapping → AI-TDD synthesis

String Calculator [16]: Roy Osherove’s canonical warm-up. Starts with add("") == 0, incrementally adds comma delimiters, newlines, custom delimiters, and negatives. For an expert audience, run it twice: once without AI (baseline), once with AI (measure delta). The comparison lands more convincingly than any slide. See also garora/TDD-Katas ⭐ 735 [22] for a broader multi-language kata collection.

Skeleton katas [2] [4]: both pre-write the full test suite; participants use their AI tool to make tests pass. Eficode’s Tetris ⭐ 0 phases tests across board initialization, movement mechanics, line clearing, and scoring. Goose Game ⭐ 3 (Kotlin) adds a prompts/ log that becomes debrief material on where AI needed disambiguation [4].

Legacy katas [9]: The Gilded Rose is the entry point — pure nested conditionals, no external dependencies, so participants focus on writing characterization tests before touching any logic. Trip Service escalates: participants must break HTTP/DB dependencies to get code under test, a problem AI alone cannot reliably solve without human seam design. Bourgau’s dojo progression [8] maps the full 4-stage path from FizzBuzz to Ugly Trivia if you want to run a multi-session track.

Frameworks: TDAID and EXACT

Two frameworks add structure that’s worth teaching explicitly alongside the exercises.

TDAID (Test-Driven AI Development) [3] extends the classic loop with a Plan phase before Red and a Validate phase after Refactor. Plan: AI generates a structured implementation roadmap. Validate: human reviews the diff to ensure the agent didn’t “cheat” by writing tests that confirm broken behavior [1] [21]. Exercise shape for TDAID: write the plan as a comment block, let AI drive Red → Green → Refactor, then human-review the git diff before moving to the next increment.

EXACT Coding [12] (Example-guided AI-Collaborative Test-driven Coding) prepends an Example Mapping session to the first test: team clarifies story, rules, examples, and open questions in a short structured conversation. Three autonomy levels let participants choose their control posture:

Level AI runs until… Recommended for
A End of feature Experienced AI users, speed
B End of each RGR cycle Default; balanced control
C End of each phase Learning mode; max oversight

Level B is the default for workshop use — frequent enough to stay engaged, coarse enough that AI assistance feels meaningful [12]. The GitHub Copilot Workshop [15] structures a similar three-path progression (IDE features → pro/agents → CLI/SDK) that maps well onto beginner-to-expert cohorts.

For AI-specific techniques inside an exercise, Automattic’s pattern [11] is worth demonstrating: after writing one test, ask the AI to “triangulate examples” — it generates additional edge-case assertions from existing code structure, eliminating manual boilerplate. Similarly, GitHub’s /tests slash command [10] lets participants describe requirements in natural language and get AI-generated test scaffolding back in one step.

Anti-pattern Demo: Vibe Coding vs. TDD

Reserve 10–15 minutes for a live demonstration of the failure mode. Start a feature without tests, use AI to “just implement it,” add a second feature, observe the architecture degrade. Without tests, AI coding agents never spontaneously suggest refactoring, producing monolithic, tightly coupled code where each new feature takes longer than the last [20]. The fix isn’t discipline — it’s tests: they enforce interface stability, make hallucinated code fail immediately, and prevent the “refactoring avoidance” anti-pattern [14]. Show the vibe-coded diff alongside the TDD diff; expert developers will internalize the point without further argument.

Planning-first also helps: creating a mini-PRD or SPEC.md before prompting [13] shifts AI from free-wheeling code generator to constrained implementation engine — a pattern that pairs naturally with EXACT’s Example Mapping step.

Expert Audience Considerations

  • Skip TDD theory. They know what red-green-refactor is. Spend that time on what changes with AI in the loop: the Validate phase, autonomy levels, prompt engineering.
  • Use real-world complexity. Codely’s approach [7] of working on existing codebases (not greenfield toys) is more relevant and more engaging for senior developers.
  • Pair strategically. Pair architects (who own test strategy and system design) with devs who drive agent prompts and the RGR cycle [19]. Knowledge transfer surfaces naturally without making it the explicit goal.
  • Leave autonomy open. Let participants choose their AI tool’s EXACT autonomy level rather than mandating one [12]. Comparing Level A vs Level C choices in debrief is itself a rich discussion.
  • Debrief the prompts, not the code. The most valuable expert discussion is about prompt quality: what context the AI needed, where it hallucinated, where it outperformed. The prompts/ log pattern [4] makes this concrete.

For open-source curricula to adapt, see GitHub Copilot Bootcamp [17] (4-week open curriculum with labs and weekly reflections) and Caltech CTME’s format [6] (6–8 hour intensive with enterprise customization).

Citations · 22 sources

Click the Citations tab to load…