← Default view
diffusion models · forward process · reference sheet

Forward Diffusion // Noise Schedules

16 sources · survey depth · 2020–2024 literature
TL;DR
Use cosine for ≤256px. Use EDM σ or Laplace for new training runs — Laplace nets a free +26.6% FID.[7] Always enforce zero terminal SNR — no off-the-shelf schedule does this by default.[5] Scale the input, not the schedule, when resolution grows past 256px.[4]
Property Linear
Ho 2020
Cosine
Nichol 2021
Sigmoid
2022
EDM σ
Karras 2022
Laplace
Hang 2024
SDE family VP VP VP VE VP
Param axis β_t linear ᾱ_t cosine² ᾱ_t sigmoid σ lognormal log-SNR Laplace
Train distribution Uniform over T=1000 steps Uniform over T steps Uniform over T steps LogNormal
(μ=−1.2, σ=1.2)
Laplace
concentrated at log-SNR≈0
Terminal SNR=0 ✗ no ✗ no ✗ no ≈ yes ✗ no
≤64px too fast good fine good good
256px suboptimal optimal good optimal optimal
1024px+ under-noises under-noises stable scale σ_max robust
FID CIFAR-10 ~3.2 (DDPM baseline)[1] ~2.9 (Improved DDPM)[2] 1.79 (35 NFE)[6] see ImageNet-256
FID ImageNet-256 10.85 (baseline)[7] 7.96 (−26.6%)[7]
Adoption legacy widespread niche SOTA baseline emerging
Verdict avoid use ≤256px high-res only new training ★ best FID
⚠ Critical flaw: non-zero terminal SNR

No standard schedule enforces SNR(T) = 0. At the final timestep, the noised sample still carries residual signal. Inference starts from pure Gaussian noise — a condition the model was never trained on.[5][15]

  • Stable Diffusion cannot generate very bright or very dark images — locked to medium brightness
  • Training/inference mismatch degrades sample quality at all resolutions
Fixes (apply all four)
  1. Rescale β to enforce SNR(T) = 0 exactly
  2. Switch ε-prediction → v-prediction
  3. Always initiate sampler from the true final timestep
  4. Rescale classifier-free guidance to prevent overexposure in early steps
Resolution decision guide
≤ 64px cosine
≤ 256px cosine (default)
256–512px EDM σ or scale input
512–1024px sigmoid or scale input ×b
new training run laplace / EDM
debugging quality track log-SNR, not t
Key insight: importance sampling of log-SNR is mathematically equivalent to changing the schedule.[7] The critical region is log-SNR ≈ 0 — concentrate training mass there.
Forward marginal (any schedule)
q(x_t | x_0) = N(x_t ; √ᾱ_t · x_0, (1-ᾱ_t) I)

x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
      ε ~ N(0, I)
Closed form — sample any x_t from x_0 directly, without simulating intermediate steps.[1]
Cosine schedule (Nichol 2021)
f(t) = cos²( π/2 · (t/T + s) / (1+s) )
ᾱ_t  = f(t) / f(0)
β_t  = 1 - ᾱ_t / ᾱ_{t-1},  clip ≤ 0.999

s = 0.008  ← keeps √β_t < 1/127.5
Offset s prevents information loss from overlapping noise intervals.[9]
EDM σ-schedule (Karras 2022)
σ_i = ( σ_max^(1/ρ)
      + (i/(T-1)) · (σ_min^(1/ρ) - σ_max^(1/ρ)) )^ρ

σ_min=0.002  σ_max=80  ρ=7

Train: P(σ) = LogNormal(μ=-1.2, σ=1.2)
Decouples network preconditioning from schedule. VE regime.[6] [13]
Laplace log-SNR schedule (Hang 2024)
p(λ) = exp(-|λ - μ| / b) / (2b)
         λ = log-SNR

Cauchy variant:
p(λ) = γ / (π · ((λ-μ)² + γ²))
Sharp peak at log-SNR≈0. Cauchy has heavier tails for broader coverage.[7]
Symbol Definition Role
β_t noise variance at step t Schedule parameter — the thing you actually design
ᾱ_t = ∏ α_s cumulative signal weight Governs q(x_t|x_0); the forward marginal depends only on this[1]
SNR(t) ᾱ_t / (1−ᾱ_t) Natural timescale; decreases monotonically T→0. Optimal schedule = SNR curve shape[4]
log-SNR(t) log(ᾱ_t / (1−ᾱ_t)) Training distribution axis; concentrate mass near 0 for best efficiency[7]
σ (EDM) noise standard deviation EDM works in σ-space directly; avoids ᾱ discretisation entirely[6]
Variance Preserving (VP-SDE) DDPM's continuous limit
dx = -½β(t) x dt + √β(t) dw

The drift term prevents variance from exploding. All ᾱ_t-based schedules belong to this family. Bounded variance throughout the forward process.[3]

Linear Cosine Sigmoid Laplace
Variance Exploding (VE-SDE) SMLD's continuous limit
dx = √(dσ²(t)/dt) dw

No drift; variance grows to infinity. EDM's σ_max=80 approximates this regime — distribution at max noise is effectively Gaussian.[6][12]

EDM σ

Both families admit a reverse-time SDE solvable with the score function ∇_x log p(x_t) — a unified generative mechanism.[3] Flow matching (SD3, FLUX) sidesteps schedules entirely by learning a velocity field — defaults to a uniform log-SNR distribution equivalent to a specific schedule choice.[16]

HuggingFace Diffusers EDM Euler scheduler documentation
HuggingFace Diffusers: EDM Euler Scheduler — σ_min=0.002, σ_max=80, σ_data=0.5, ρ=7.[13]
EDM key properties
σ_min 0.002
σ_max 80.0
ρ 7
P_mean (train) −1.2
P_std (train) 1.2
CIFAR-10 FID 1.79 (35 NFE)
1
Start with cosine for ≤256px tasks. Linear is strictly worse at identical compute — it wastes ~40% of training on near-trivial high-noise denoising.[2]
2
Enforce zero terminal SNR on any new training run. It is a free correctness fix: rescale the β schedule, switch to v-prediction, and fix the sampler start.[5]
3
Use EDM or Laplace when designing training from scratch. Laplace's 26.6% FID gain on ImageNet-256 costs only a different schedule function — same architecture and compute.[7]
4
Scale the input, not the schedule when increasing resolution. The b-scaling trick shifts the log-SNR curve uniformly downward — a one-line change sufficient up to 1024×1024.[4]
5
Track log-SNR, not t when debugging. The nominal timestep is an arbitrary proxy; the actual quantity controlling learning difficulty is the signal-to-noise ratio.[7]