Exercise Catalog

AI-Assisted TDD · Expert Developers · Pick one warm-up + one main exercise per session. All exercises use pre-written failing tests; participants implement with their AI tool. The golden constraint that prevents AI cheating:

You may not modify the test file.

Session formats

90-minute session recommended

String Calculator 15–20 min · warm-up

Tetris or Goose Game 45–60 min · medium

Reserve 15 min for debrief. Leave autonomy level choice open — comparing Level A vs C in discussion is richer than mandating one.

Half-day session

String Calculator × 2 30 min · run without AI first, then with

Goose Game 45–60 min · prompt log debrief

Gilded Rose 40–60 min · legacy refactoring

Anti-pattern demo 10–15 min · vibe-coding failure mode

The before/after delta on String Calculator is the sharpest data point in the room. Earns more than any slide on AI productivity.

Exercises Difficulty: Warm-up Medium Medium-High High Expert

Warm-up Greenfield TDD

String Calculator

⏱ 15–20 min · no repo required

Incremental growth from add("") == 0 through comma delimiters, newlines, custom delimiters, and negatives. Primary use: calibrate the room. Run it twice — once without AI, once with — to make the productivity delta concrete. Expert developers internalize the point without further argument. [16]

See also garora/TDD-Katas ⭐ 735 for multi-language kata alternatives. [22]

Prerequisites

none

Medium Greenfield AI TDD ⭐ 0

Tetris Skeleton

⏱ 45–60 min · Eficode skeleton repo ·

github.com

Full test suite pre-written — board init → movement → line clearing → scoring. Participants use Copilot to make failing tests pass. The "you may not modify the test file" constraint prevents AI from gaming assertions. CI runs on every push, removing the facilitator as a validation bottleneck. [2]

Prerequisites

AI tool installed

Medium Greenfield AI TDD ⭐ 3

Goose Game

⏱ 45–60 min · Kotlin · xpepper ·

github.com

Full test suite pre-written. Adds a prompts/ folder where participants log each AI prompt chronologically as they work. This log becomes the primary debrief artifact: what context the AI needed, where it hallucinated, where it outperformed. Expert developers find the prompt-quality discussion more valuable than the code itself. [4]

Prerequisites

AI tool installed TDD basics

Medium-High Legacy refactoring

Gilded Rose

⏱ 40–60 min · pure nested conditionals · no external deps

Write characterization tests before touching any logic. No HTTP, no DB — all complexity is cognitive. Participants experience the AI's tendency to "refactor" without tests and observe architectural degradation in real time. Entry point to Bourgau's 4-stage legacy code progression. [9] [8]

Prerequisites

TDD basics Legacy code concepts

High Dependency breaking

Trip Service

⏱ 45–60 min · HTTP + DB seam design

Break HTTP and DB dependencies to get code under test before writing a single assertion. AI alone cannot reliably design seams — this is the exercise where human architectural judgment is the critical input, not prompt quality. The sharpest demonstration that TDD-with-AI is developer-led, not AI-delegated. [9]

Prerequisites

Dependency injection Gilded Rose first

Expert Full EXACT workflow

EXACT Mini-Project

⏱ 60–90 min · facilitator-required

Start with Example Mapping (story → rules → examples → open questions), then run EXACT at Level B: pause after each Red-Green-Refactor cycle for human review. Synthesises everything: prompt engineering, autonomy control, and test-as-spec discipline. Leave autonomy level choice open; comparing Level A vs C choices in debrief is the workshop's richest discussion. [12]

Prerequisites

All prior exercises EXACT framework Facilitator required

Scaffolding checklist

📦

Dev Container / GitHub Codespaces

One-click identical runtime + AI extension. Eliminates environment setup as a time sink before the first exercise starts. Centric offers choice of .NET, Spring Boot, or bilingual containers. [5]

🔒

.github/copilot-instructions.md

Encodes TDD rules so the AI tool itself enforces the discipline: "write the failing test first; never implement without a red test; you may not modify the test file." [18]

🔴

Failing tests pre-written

Skeleton repo with tests present, implementation stubs empty. Enforces TDD discipline without requiring participants to write test specs under time pressure. [2]

🌿

Reference solution branch

Unblocks stuck participants without spoiling the exercise for everyone else. Centric names it solution/. Participants reach for it when needed — not on a timer. [5]

📝

prompts/ folder

Participants log each AI prompt chronologically. Makes debrief on prompt quality concrete rather than hypothetical: what context AI needed, where it hallucinated. [4]

✅

CI on every push

Instant green/red via GitHub Actions. Removes the facilitator as a validation bottleneck — participants self-verify. Caltech's format adds a take-home checklist that outlasts the session. [2] [6]

Frameworks

TDAID

Test-Driven AI Development

1 Plan — AI generates a structured implementation roadmap as a comment block before any code is written

2 Red — write the failing test

3 Green — AI drives minimal implementation

4 Refactor — clean up with AI assistance

5 Validate — human reviews the git diff; confirms tests don't verify broken behavior [3] [1]

Workshop shape: write the Plan as a comment block, let AI drive Red → Green → Refactor, then human-review the diff before moving to the next increment.

EXACT

Example-guided AI-Collaborative Test-driven Coding

0 Example Mapping — story → rules → examples → open questions. Runs before the first test is written.

Level	AI pauses after…	Best for
A	End of feature	Speed · experienced users
B default	End of RGR cycle	Workshop default · balanced
C	End of each phase	Learning mode · max oversight

Leave autonomy level open during exercises — participant choice becomes the richest debrief material. The GitHub Copilot Workshop maps a similar three-path progression onto beginner-to-expert cohorts. [12] [15]

Expert audience notes

Skip TDD theory. They know red-green-refactor. Spend that time on what changes with AI in the loop: the Validate phase, autonomy levels, and prompt engineering.

Debrief the prompts, not the code. The prompts/ log makes this concrete. The most valuable expert discussion is about prompt quality — not implementation choices. [4]

Pair strategically. Architects own test strategy and system design; devs drive agent prompts and the RGR cycle. Knowledge transfer surfaces naturally without making it the explicit goal. [19]

Use real-world complexity. Greenfield toys disengage senior developers. Working on existing codebases is more relevant and more engaging for the cohort. [7]

Demo the failure mode. Implement a feature without tests via AI, add a second feature, observe architectural degradation live. AI never spontaneously suggests refactoring without test constraints. Show the diff. [20]

Plan before coding. A SPEC.md or mini-PRD before prompting shifts AI from free-wheeling generator to constrained implementation engine — pairs naturally with EXACT's Example Mapping step. [13]