TL;DR — The network that runs reverse denoising isn’t predicting “the image”; it’s estimating the score, the gradient of the log data density
∇ log p(x), which points back toward clean data at every noise level [1]. Predicting noise (ε), the clean signal (x₀), velocity (v), or a flow-matching drift are all the same target reparametrized — for Gaussian noise the score is just−ε/σ[5]. The backbone that computes it has migrated from the convolutional U-Net (Stable Diffusion, ADM) to the Diffusion Transformer (DiT → MMDiT), which now powers FLUX.1 ⭐ 25.6k, SD3.5, and Sora because attention scales more smoothly than convolution [18][45].
The one idea: the denoiser is a score network
Reverse denoising looks like “remove a little noise, repeat.” Mathematically it’s gradient ascent on probability. A score-based model is a network s_θ(x) trained to approximate the score function — the gradient of the log-density ∇ₓ log p(x) — which sidesteps the intractable normalizing constant that plagues density models [1]. Once you know the score at every noise level, you generate samples by Langevin dynamics: repeatedly step along the score and inject a little Gaussian noise [1]. Song & Ermon’s NCSN did exactly this — estimate scores across many noise scales, then sample with annealed Langevin dynamics — and it is the direct precursor to modern diffusion [6].
Three classical results glue “denoising” to “score”:
- Denoising score matching (Vincent, 2011): training a denoising autoencoder to reconstruct clean data from a Gaussian-corrupted input is mathematically equivalent to matching the score of the noised density — and it avoids the second derivatives plain score matching needs [3].
- Tweedie / Miyasawa: for
Y = X + σε, the posterior mean isE[X|Y] = Y + σ²∇log p_σ(Y), equivalentlyE[ε|Y] = −σ²∇log p_σ(Y)— so minimizing the denoising MSE recovers the score directly [4]. - The ε↔score identity: for Gaussian perturbations
∇log p = −ε/σ, which makes DDPM’s noise-prediction network a reparametrized score model,s_θ(xₜ,t) = −ε_θ(xₜ,t)/√(1−ᾱₜ)[5][36].
Ho et al.’s DDPM uses the ε-prediction objective, and its simplified loss coincides with multi-scale denoising score matching [7]. Song et al.’s score-SDE framework unifies everything: a forward noising SDE has a closed-form reverse-time SDE, dx = [f − g²∇log pₜ] dt + g dw, driven solely by the score, plus a deterministic probability-flow ODE, dx = [f − ½g²∇log pₜ] dt, that shares the same marginals and enables exact likelihoods [2]. Everything below is how to build the box that outputs s_θ.
Backbone 1 — the U-Net denoiser
The denoiser inherits its skeleton from the U-Net of Ronneberger et al. (2015): a fully convolutional encoder–decoder with a contracting path (two 3×3 convs + ReLU, then 2× max-pool), a symmetric expanding path of up-convolutions, and skip connections that concatenate encoder features into the matching decoder stage to preserve spatial detail [8][9]. Because it outputs a tensor the same size as its input, it is a natural fit for predicting per-pixel noise — which is exactly why diffusers ⭐ 33.8k packages it as the default UNet2DConditionModel [17].
Dhariwal & Nichol’s ADM (“Diffusion Models Beat GANs,” 2021) ablated this skeleton into the dominant form and beat GANs on ImageNet (FID 2.97 / 4.59 / 7.72 at 128/256/512) [10]. The diffusion-specific changes:
| Modification | What it does | Source |
|---|---|---|
| More depth over width | better FID at fixed compute | [11] |
| Multi-head attention (64 ch/head) at 32², 16², 8² | global coherence at low resolutions | [11] |
| BigGAN residual blocks for up/downsampling | stable feature rescaling | [11] |
| Residual rescale by 1/√2 | training stability | [11] |
| Adaptive Group Norm (AdaGN) | injects timestep + class embedding into each res block | [11] |
The timestep enters as a sinusoidal positional embedding (or random Fourier features), MLP-projected and added into residual blocks or fed through AdaGN [13]. Rombach et al.’s Latent Diffusion (CVPR 2022) then made two moves that defined Stable Diffusion: run the U-Net in a compressed VAE latent space, and add cross-attention layers so the denoiser becomes a text/conditional generator [12]. The Stable Diffusion U-Net has four stages — channels {320, 640, 1280, 1280} — each with 2–3 ResNet blocks plus 8-head self-attention and 8-head cross-attention to CLIP embeddings [14]. Its open-source footprint cemented the design: CompVis/stable-diffusion ⭐ 73k (Jun 2026) [15] and the original latent-diffusion ⭐ 14k [16].
Backbone 2 — Diffusion Transformers (DiT → MMDiT)
Peebles & Xie’s DiT discards the U-Net entirely: a plain transformer operating on patchified latents. A VAE downsamples the image 8× to a 32×32×4 latent, which is cut into p×p patches yielding T = (I/p)² tokens; halving the patch size quadruples token count and Gflops [19]. Conditioning uses adaLN-zero: scale/shift parameters are regressed from the summed timestep+class embedding, and the per-block residual modulation is zero-initialized so each block starts as the identity function [18]. This beats both cross-attention and in-context conditioning while being the most compute-efficient [20]. The headline result is a clean scaling law: higher-Gflops DiTs consistently reach lower FID, with DiT-XL/2 hitting FID-50K 2.27 on ImageNet 256² [19]. Repo: facebookresearch/DiT ⭐ 8.6k [28].
The template propagated fast:
- SD3 → MMDiT (Esser et al., 2024): separate weights per modality (image vs text) but a joined sequence for shared attention, conditioned on two CLIP encoders plus T5, trained under reweighted Rectified Flow; scales 450M → 8B params without saturating [21][22].
- FLUX.1 (Black Forest Labs ⭐ 25.6k, Aug 2024): a 12B rectified-flow transformer hybridizing MMDiT “double-stream” blocks (text+image tokens jointly attended) with “single-stream” parallel DiT blocks, conditioned on T5-XXL + CLIP [23][24][29].
- Sora: a DiT over spacetime patches of video latents as tokens, enabling variable resolution and duration [25].
- PixArt-α keeps cross-attention to inject text into DiT blocks, reaching near-SOTA at ~1% of RAPHAEL’s training cost [26]; Hunyuan-DiT is a multi-resolution DiT with bilingual Chinese/English understanding [27].
- SiT (Scalable Interpolant Transformers, ECCV 2024) keeps the DiT backbone wholesale but swaps the diffusion formulation for a flexible stochastic-interpolant/flow framework, then decouples four design axes (discrete vs. continuous time, prediction target, interpolant choice, and a deterministic vs. stochastic sampler) from the fixed network [51]. Holding architecture and compute constant, SiT-XL/2 beats DiT at every size — FID-50K 2.06 at 256² — because the sampler’s diffusion coefficient can be tuned separately from learning [51][52]. Repo: willisma/SiT ⭐ 1.2k.
The prediction target: ε, x₀, v, or flow
The architectures above all output something, and the choice of target changes training stability and few-step behaviour — but every option maps back to the same score.
| Parametrization | Network predicts | Notes | Source |
|---|---|---|---|
| ε-prediction | the added noise | DDPM default; L_simple = E‖ε − ε_θ‖²; s_θ = −ε_θ/√(1−ᾱₜ) |
[7][36] |
| x₀-prediction | the clean signal | better behaved at high noise | [37] |
| v-prediction | velocity v = αₜ·ε − σₜ·x₀ |
~constant variance across t; standard for distillation/few-step | [30][37] |
| Flow-matching / rectified flow | drift v = X₁ − X₀ along straight paths |
Xₜ = t·X₁ + (1−t)·X₀; equals v-pred up to weighting |
[35] |
Karras et al.’s EDM generalizes the lot with preconditioning: D_θ = c_skip·x + c_out·F_θ(c_in·x; c_noise), with coefficients chosen so the effective target has unit variance at every noise level (c_skip = σ_data²/(σ²+σ_data²), c_out = σ·σ_data/√(σ²+σ_data²), loss weight λ(σ) = 1/c_out²) [31][32]. Min-SNR then frames training as multi-task learning, clamping per-timestep weights wₜ = min(SNR(t), γ) with default γ=5 — giving 3.4× faster convergence and a then-record FID 2.06; notably, constant weighting matches ε-prediction, SNR weighting matches x₀, and v-prediction divides the weight by (SNR+1) [33][34]. The takeaway: parametrization is a variance-reduction and weighting choice, not a different model.
Conditioning the score network
A text-to-image score network must steer ∇log p(x) toward ∇log p(x | prompt). Three mechanisms, often combined:
- Cross-attention text injection. Latent Diffusion turns the denoiser into a conditional generator by feeding frozen text-encoder embeddings as the keys/values of cross-attention layers [12]. Imagen showed that generic frozen LLMs like T5 are surprisingly effective, and that scaling the text encoder boosts fidelity and alignment more than scaling the diffusion model itself [40]. SD3 concatenates three encoders — OpenCLIP-ViT/G, CLIP-ViT/L, and T5-xxl [41].
- adaLN conditioning. In DiT/MMDiT the condition is regressed into per-block scale/shift instead of attended to — lower FID and cheaper than cross-attention [20].
- Guidance. Dhariwal & Nichol’s classifier guidance pushes samples with a noisy-image classifier’s gradient (sampling ∝
p(y|x)^s) [10]. Ho & Salimans’ classifier-free guidance drops the classifier: jointly train a conditional and unconditional model and combine the two score estimates at sample time [38] — exposed in diffusers asguidance_scale(higher = follows prompt more closely, too high = artifacts) [39].
Underneath it all, latent diffusion is the architectural enabler: the VAE offloads perceptual detail so the score network operates on a small latent grid, slashing compute versus pixel-space diffusion [12].
State of the art and trade-offs (2026)
The defining story of 2024–2026 is the U-Net → DiT migration. U-Nets dominated 2021–2023 (SDXL at 2.57B params) but hit a scaling ceiling around ~2.6B; as data and compute grew, the bottleneck shifted from local fidelity to global semantic alignment, which favors attention [45]. The consensus rationale: attention strictly generalizes convolution and scales more smoothly, so at matched size U-Nets underperform DiTs [45][13].
| System | Backbone | Target | Text encoders | Source |
|---|---|---|---|---|
| Stable Diffusion 1.x | U-Net + cross-attn | ε | CLIP | [14] |
| SD3.5 | MMDiT + adaLN + QK-norm | rectified flow | CLIP-G, CLIP-L, T5-XXL | [43][44] |
| FLUX.1 (12B) | hybrid MMDiT + parallel DiT | rectified flow | T5-XXL, CLIP | [23][24] |
| Ideogram 4.0 (9.3B) | single-stream DiT, 34 layers | — | Qwen3-VL-8B | [49] |
| Midjourney | undisclosed diffusion | undisclosed | undisclosed | [48] |
⚠ Midjourney remains a closed box — a proprietary diffusion service tuned heavily for aesthetics, with no published architecture; any specifics about its internals are speculation [48]. FLUX.1 Krea (July 2025) is a useful public reference for where production is heading: a rectified-flow transformer explicitly targeting de-saturated photorealism over the over-saturated “AI look” [42].
On the efficiency axis, the action is distillation: knowledge, progressive, consistency, score, and adversarial distillation collapse sampling to 1–4 steps, where training-free samplers still need 10+ [46]. Latent Consistency Models predict the PF-ODE solution in latent space for 2–4 steps; SDXL-Turbo uses adversarial+score distillation for a single step; SDXL-Lightning uses progressive adversarial distillation [47]. The forward edge is unified multimodal / autoregressive generation — GPT Image, HunyuanImage, Transfusion — with hybrid AR+diffusion transformers blending sequence modeling and diffusion objectives [50].
Bottom line for the prompt engineer: the model that turns your prompt into pixels is a score network — a noise/velocity predictor wrapped in a U-Net or (increasingly) a transformer — whose guidance_scale knob literally scales how hard the combined conditional-minus-unconditional score pulls the sample toward your words [38][39].