diffusion models · forward process · reference sheet

Forward Diffusion // Noise Schedules

16 sources · survey depth · 2020–2024 literature

TL;DR

Use cosine for ≤256px. Use EDM σ or Laplace for new training runs — Laplace nets a free +26.6% FID.^[7] Always enforce zero terminal SNR — no off-the-shelf schedule does this by default.^[5] Scale the input, not the schedule, when resolution grows past 256px.^[4]

Schedule comparison matrix

Property	Linear Ho 2020	Cosine Nichol 2021	Sigmoid 2022	EDM σ Karras 2022	Laplace Hang 2024
SDE family	VP	VP	VP	VE	VP
Param axis	`β_t` linear	`ᾱ_t` cosine²	`ᾱ_t` sigmoid	`σ` lognormal	log-SNR Laplace
Train distribution	Uniform over T=1000 steps	Uniform over T steps	Uniform over T steps	LogNormal (μ=−1.2, σ=1.2)	Laplace concentrated at log-SNR≈0
Terminal SNR=0	✗ no	✗ no	✗ no	≈ yes	✗ no
≤64px	too fast	good	fine	good	good
256px	suboptimal	optimal	good	optimal	optimal
1024px+	under-noises	under-noises	stable	scale σ_max	robust
FID CIFAR-10	~3.2 (DDPM baseline)^[1]	~2.9 (Improved DDPM)^[2]	—	1.79 (35 NFE)^[6]	see ImageNet-256
FID ImageNet-256	—	10.85 (baseline)^[7]	—	—	7.96 (−26.6%)^[7]
Adoption	legacy	widespread	niche	SOTA baseline	emerging
Verdict	avoid	use ≤256px	high-res only	new training	★ best FID

Critical flaw & decision guide

⚠ Critical flaw: non-zero terminal SNR

No standard schedule enforces SNR(T) = 0. At the final timestep, the noised sample still carries residual signal. Inference starts from pure Gaussian noise — a condition the model was never trained on.^[5]^[15]

Stable Diffusion cannot generate very bright or very dark images — locked to medium brightness
Training/inference mismatch degrades sample quality at all resolutions

Fixes (apply all four)

Rescale β to enforce SNR(T) = 0 exactly
Switch ε-prediction → v-prediction
Always initiate sampler from the true final timestep
Rescale classifier-free guidance to prevent overexposure in early steps

Resolution decision guide

≤ 64px → cosine

≤ 256px → cosine (default)

256–512px → EDM σ or scale input

512–1024px → sigmoid or scale input ×b

new training run → laplace / EDM

debugging quality → track log-SNR, not t

Key insight: importance sampling of log-SNR is mathematically equivalent to changing the schedule.^[7] The critical region is log-SNR ≈ 0 — concentrate training mass there.

Key formulas

Forward marginal (any schedule)

q(x_t | x_0) = N(x_t ; √ᾱ_t · x_0, (1-ᾱ_t) I)

x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
      ε ~ N(0, I)

Closed form — sample any x_t from x_0 directly, without simulating intermediate steps.^[1]

Cosine schedule (Nichol 2021)

f(t) = cos²( π/2 · (t/T + s) / (1+s) )
ᾱ_t  = f(t) / f(0)
β_t  = 1 - ᾱ_t / ᾱ_{t-1},  clip ≤ 0.999

s = 0.008  ← keeps √β_t < 1/127.5

Offset s prevents information loss from overlapping noise intervals.^[9]

EDM σ-schedule (Karras 2022)

σ_i = ( σ_max^(1/ρ)
      + (i/(T-1)) · (σ_min^(1/ρ) - σ_max^(1/ρ)) )^ρ

σ_min=0.002  σ_max=80  ρ=7

Train: P(σ) = LogNormal(μ=-1.2, σ=1.2)

Decouples network preconditioning from schedule. VE regime.^[6] ^[13]

Laplace log-SNR schedule (Hang 2024)

p(λ) = exp(-|λ - μ| / b) / (2b)
         λ = log-SNR

Cauchy variant:
p(λ) = γ / (π · ((λ-μ)² + γ²))

Sharp peak at log-SNR≈0. Cauchy has heavier tails for broader coverage.^[7]

Key quantities

Symbol	Definition	Role
β_t	noise variance at step t	Schedule parameter — the thing you actually design
ᾱ_t = ∏ α_s	cumulative signal weight	Governs q(x_t\|x_0); the forward marginal depends only on this^[1]
SNR(t)	ᾱ_t / (1−ᾱ_t)	Natural timescale; decreases monotonically T→0. Optimal schedule = SNR curve shape^[4]
log-SNR(t)	log(ᾱ_t / (1−ᾱ_t))	Training distribution axis; concentrate mass near 0 for best efficiency^[7]
σ (EDM)	noise standard deviation	EDM works in σ-space directly; avoids ᾱ discretisation entirely^[6]

SDE classification: VP vs VE

Variance Preserving (VP-SDE) DDPM's continuous limit

dx = -½β(t) x dt + √β(t) dw

The drift term prevents variance from exploding. All ᾱ_t-based schedules belong to this family. Bounded variance throughout the forward process.^[3]

Linear Cosine Sigmoid Laplace

Variance Exploding (VE-SDE) SMLD's continuous limit

dx = √(dσ²(t)/dt) dw

No drift; variance grows to infinity. EDM's σ_max=80 approximates this regime — distribution at max noise is effectively Gaussian.^[6]^[12]

EDM σ

Both families admit a reverse-time SDE solvable with the score function ∇_x log p(x_t) — a unified generative mechanism.^[3] Flow matching (SD3, FLUX) sidesteps schedules entirely by learning a velocity field — defaults to a uniform log-SNR distribution equivalent to a specific schedule choice.^[16]

EDM Euler scheduler — parameter surface

HuggingFace Diffusers EDM Euler scheduler documentation

HuggingFace Diffusers: EDM Euler Scheduler — σ_min=0.002, σ_max=80, σ_data=0.5, ρ=7.^[13]

EDM key properties

σ_min → 0.002

σ_max → 80.0

ρ → 7

P_mean (train) → −1.2

P_std (train) → 1.2

CIFAR-10 FID → 1.79 (35 NFE)

Practical guidance

Start with cosine for ≤256px tasks. Linear is strictly worse at identical compute — it wastes ~40% of training on near-trivial high-noise denoising.^[2]

Enforce zero terminal SNR on any new training run. It is a free correctness fix: rescale the β schedule, switch to v-prediction, and fix the sampler start.^[5]

Use EDM or Laplace when designing training from scratch. Laplace's 26.6% FID gain on ImageNet-256 costs only a different schedule function — same architecture and compute.^[7]

Scale the input, not the schedule when increasing resolution. The b-scaling trick shifts the log-SNR curve uniformly downward — a one-line change sufficient up to 1024×1024.^[4]

Track log-SNR, not t when debugging. The nominal timestep is an arbitrary proxy; the actual quantity controlling learning difficulty is the signal-to-noise ratio.^[7]

Sources (16)