The deepest unifying thread across this expedition is a single mathematical object: the score function — the gradient of the log data density ∇ log p(x). Forward diffusion destroys information by following a noise schedule; reverse diffusion reconstructs by following the score back toward data. Every network architecture surveyed (U-Net, DiT, MMDiT, FLUX) is just a different box for computing this gradient. Every training objective — ε-prediction, x₀-prediction, v-prediction, rectified flow — is a reparametrization of the same score, [1] and Karras et al.’s EDM framework makes the equivalence explicit via preconditioning coefficients. [2] When you increase Midjourney’s --stylize or a diffusers guidance_scale, you are literally amplifying the conditional score relative to the unconditional one — CFG is ε̂ = ε(x,∅) + w·(ε(x,text) − ε(x,∅)), steering the gradient trajectory in latent space.
Architecture and anatomy are causally linked. The hands post-mortem and the architecture survey are really the same story told from different angles. U-Net’s convolutional locality meant fingers were processed without global palm context — the network learned fingers as local textures, not as a kinematically constrained structure. [3] DiT and MMDiT replace this with full-sequence self-attention: every finger token attends to every palm token in every layer. [4] FLUX.1’s 12B-parameter MMDiT achieved “correct finger count in the vast majority of generations” not from anatomical training but from the architectural side-effect that global attention enforces global consistency. [5] The scaling law DiT demonstrated — more compute → lower FID monotonically — is the same reason FLUX outperforms SDXL on anatomy at matched prompt complexity.
The text encoder stack is the hidden variable behind prompt strategy. CLIP’s 77-token hard limit and visual-contrastive training bias made keyword-front-loading rational in SD 1.x: token 78 onward was invisible, so packing the most important terms early was load-bearing engineering, not stylistic preference. [6] Imagen’s demonstration that a frozen T5-XXL outperforms CLIP on compositional prompts [7] — and SD3/FLUX’s stacking of dual CLIPs with T5 — is what legitimises Midjourney v6’s “write a natural scene description” instruction. The model has deep language understanding of grammar, syntax, and spatial relations available once T5 is in the stack; the keyword-soup habit actively fights against it by fragmenting coherent grammatical structure into a bag of tokens. By 2026 HiDream-I1 uses a full Llama-3.1-8B encoder [8] — at that point prompt writing and LLM prompting converge completely.
Resolution and the VAE interact in ways the surface-level pipeline hides. The VAE’s 8× spatial compression means a full-frame hand at 512px collapses to 5–8 latent pixels per finger — the decoder is asked to reconstruct detail it never saw encoded. [9] SDXL’s move to 1024px native resolution was not cosmetic; it tripled the effective latent resolution for fine anatomy. ControlNet’s depth-map injection acts as a third bypass: it supplies the spatial constraints the latent bottleneck loses, explicitly conditioning the score network on 3D structure rather than asking it to infer depth from pixel patterns alone. The open question is whether higher-resolution VAEs (SD3 uses a 16-channel VAE vs SD 1.5’s 4-channel) make ControlNet redundant for anatomy, or whether explicit structural conditioning remains complementary regardless of VAE capacity.
The noise schedule’s effect on practitioners is underappreciated. Every sampler tuning — DDIM vs DPM-Solver vs Euler, step counts, sigma schedule — changes which part of the SNR curve gets the most compute. The EDM finding that zero terminal SNR must be enforced [10] is directly observable in generation: models trained with the common off-the-shelf cosine schedule produce slightly hazy darks and cannot generate pure-black backgrounds. The practitioner proxy for this is noticing when a model refuses to produce high-contrast results — the symptom of non-zero terminal SNR leaking into inference.
What no single child fully resolves: how the shift to rectified flow (FLUX, SD3) changes the prompt-engineering intuitions built on DDPM-style score matching. Rectified flow learns straight transport paths rather than curved SDE trajectories [11] — empirically this makes fewer steps viable, but whether it changes which prompt tokens govern which spatial regions in cross-attention, and therefore whether front-loading rules still apply, is not yet settled in the literature.