The backbone evolution of reverse diffusion — from convolutional U-Net through DiT to hybrid rectified-flow transformers — annotated with metrics, code, and 52 citations.
ε_θ(xₜ, t)1/√2 for training stabilityAdaGN(h,y) = y_s · GroupNorm(h) + y_b — sinusoidal timestep embedding + class label injected into every residual block{320, 640, 1280, 1280}, 2-3 ResNet blocks per stage + 8-head self-attn + 8-head cross-attn to CLIP-ViT/L (77-token limit)p×p patches → T = (32/p)² tokens → standard ViT transformer stack
UNet2DConditionModel for SD 1.x/SDXL pipelines and Transformer2DModel for DiT-based pipelines — behind a unified DiffusionPipeline interface. The abstraction that made architecture experimentation accessible without hand-wiring every backbone variant. [17]| Parametrisation | Network outputs | Score identity | Training notes | Src |
|---|---|---|---|---|
ε-prediction |
Added noise ε |
s_θ = −ε_θ / √(1−ᾱₜ) |
DDPM default; simplified loss E‖ε − ε_θ‖²; constant timestep weight ≡ constant SNR weighting |
[7] [36] |
x₀-prediction |
Clean signal x₀ |
s_θ = (xₜ − x₀_θ) / σ² |
Better behaved at high noise; loss weighting proportional to SNR | [37] |
v-prediction |
Velocity v = αₜ·ε − σₜ·x₀ |
Interpolates ε and x₀ representations | ~Constant-variance target across all t; standard for distillation; Min-SNR weight = SNR / (SNR+1) | [30] [34] |
Rectified flow |
Drift v = X₁ − X₀ |
Straight ODE paths; equals v-pred up to weighting | Xₜ = t·X₁ + (1−t)·X₀; straighter paths → fewer viable sampling steps; used in SD3, FLUX.1 |
[35] |
EDM (Karras et al.) unifies all four: D_θ = c_skip·x + c_out·F_θ(c_in·x; c_noise), selecting coefficients for unit-variance targets at every σ level. [31]
Text encoder embeddings become keys/values; latent spatial features become queries. Introduced by Latent Diffusion, used by SD 1.x and PixArt-α. Imagen showed frozen T5-XXL outperforms CLIP on compositional prompts. [12][40]
Scale/shift parameters regressed from summed timestep + class/text embedding; residual modulation zero-initialised. DiT found this beats cross-attention and in-context conditioning at matched compute. Dominant in all MMDiT-lineage models. [20]
Jointly train conditional + unconditional model. Combine score estimates at inference: ε̂ = ε(x,∅) + w·(ε(x,text) − ε(x,∅)). The guidance_scale knob literally scales how hard the conditional score gradient pulls the sample toward the prompt. [38]
| System | Backbone | Target | Text encoders | Src |
|---|---|---|---|---|
| SD 1.x | U-Net + cross-attn | ε-prediction | CLIP-ViT/L (77 tok) | [14] |
| SD3.5 | MMDiT + adaLN + QK-norm | Rectified flow | OpenCLIP-ViT/G + CLIP-ViT/L + T5-XXL | [43] [44] |
| FLUX.1 (12B) | Hybrid MMDiT + parallel DiT | Rectified flow | T5-XXL + CLIP | [23] [24] |
| Ideogram 4.0 (9.3B) | Single-stream DiT, 34 layers | — | Qwen3-VL-8B-Instruct | [49] |
| Sora (OpenAI) | DiT over spacetime patches | — | Undisclosed | [25] |
| Midjourney | Closed / undisclosed | Undisclosed | Undisclosed | [48] |