Forward Diffusion and Noise Schedules

TL;DR Use cosine for ≤256px generation; switch to EDM’s σ-schedule (Karras, NeurIPS 2022) for continuous-time high-quality training. Enforce zero terminal SNR — every common off-the-shelf schedule violates it [5]. Scale the schedule (or just the input) toward more noise as resolution grows: the same cosine schedule that works at 64px will under-noise a 512px image [4].

The Forward Process

DDPM (Ho et al., 2020) defines the forward process as a fixed Markov chain that progressively destroys data by adding Gaussian noise at each step [1]:

q(x_t | x_{t-1}) = N(x_t ; sqrt(1-β_t) x_{t-1}, β_t I)

Because consecutive Gaussian steps compose analytically, the process has a closed-form marginal — you can corrupt a clean image x₀ directly to any timestep t without simulating intermediate steps [1]:

q(x_t | x_0) = N(x_t ; sqrt(ᾱ_t) x_0, (1-ᾱ_t) I)
x_t = sqrt(ᾱ_t) · x_0 + sqrt(1-ᾱ_t) · ε,   ε ~ N(0, I)

where α_t = 1 − β_t and ᾱ_t = ∏_{s=1}^{t} α_s. This reparameterized sampling form is the core of training: at each gradient step the model samples a random t, corrupts x_0 into x_t using the closed form, and learns to predict ε.

Key Derived Quantities

Everything about the forward process reduces to how ᾱ_t evolves over time:

Symbol	Definition	Role
β_t	noise variance at step t	schedule parameter
α_t = 1−β_t	per-step signal retention	intermediate
ᾱ_t = ∏ α_s	cumulative signal weight	governs q(x_t\|x_0)
SNR(t) = ᾱ_t/(1−ᾱ_t)	signal-to-noise ratio	natural timescale [4]
log-SNR(t)	log of SNR(t)	key training distribution axis [7]

The SNR decreases monotonically from high (near-clean data at t=0) to near-zero (near-pure noise at t=T). Choosing a noise schedule is exactly choosing how fast this descent happens.

Schedule Families

Linear (Ho 2020)

The original DDPM linearly interpolates β_t from β₁=10⁻⁴ to β_T=0.02 over T=1000 steps [1]:

β_t = β_1 + ((t-1)/(T-1)) · (β_T - β_1)

Simple and cheap, but aggressive: the ᾱ_t curve drops steeply, making images nearly indistinguishable from pure noise by roughly t=600. This wastes ~40% of training capacity on near-trivial denoising and produces non-zero terminal SNR [5]. For low-resolution images (32px), the linear schedule pushes data to pure noise too quickly [10].

Cosine (Nichol & Dhariwal, 2021)

Introduced in Improved DDPM to fix aggressive early noise. ᾱ_t is defined directly as a squared cosine [2] [9]:

f(t) = cos²( π/2 · (t/T + s) / (1 + s) )
ᾱ_t  = f(t) / f(0)
β_t  = 1 - ᾱ_t / ᾱ_{t-1},  clipped ≤ 0.999

Offset s=0.008 keeps √β_t below the 1/127.5 pixel bin size, preventing information loss from overlapping noise intervals [9]. The result: noise is added slowly at the start (letting the model learn coarse structure first), accelerates through the middle, and tapers near t=T — producing more consistent gradient signal at all timesteps [2].

Sigmoid

For high-resolution images, the sigmoid schedule applies smoother transitions [8] [11]. Higher-resolution images contain greater pixel-level redundancy — neighbouring pixels carry correlated information — making the denoising task inherently easier at identical noise levels. The sigmoid schedule’s gradual S-shaped curve avoids abrupt mid-process transitions that destabilise high-res training. As resolution grows, the sigmoid outperforms cosine on stability [8].

EDM σ-schedule (Karras et al., NeurIPS 2022)

Elucidating the Design Space of Diffusion-Based Generative Models (EDM) abandons the ᾱ_t parameterisation entirely, working directly in noise standard deviation σ [6]. This decouples network preconditioning from the schedule.

Inference schedule uses a ρ-polynomial interpolation [13]:

σ_i = (σ_max^(1/ρ) + (i/(T-1)) · (σ_min^(1/ρ) - σ_max^(1/ρ)))^ρ

Default parameters: σ_min=0.002, σ_max=80, ρ=7. For training, noise levels are drawn from a lognormal distribution P_train(σ) = LogNormal(P_mean=−1.2, P_std=1.2) rather than uniformly over discrete steps — concentrating learning on the informationally critical mid-noise region. EDM achieves FID 1.79 on CIFAR-10 (class-conditional) with 35 network evaluations [6].

Laplace/Cauchy (Hang et al., 2024)

Improved Noise Schedule for Diffusion Training formalises the principle behind EDM’s training distribution [7]. Key insight: importance sampling of log-SNR is mathematically equivalent to changing the schedule. The critical region is log-SNR ≈ 0 — where signal and noise energy are balanced — and concentrating training there maximises learning efficiency.

Schedules derived from target log-SNR distributions:

Laplace: p(λ) = exp(− λ−μ /b) / (2b) — sharp peak at chosen log-SNR
Cauchy: 1/π · γ/((λ−μ)²+γ²) — heavier tails for broader coverage

On ImageNet-256 with 500K iterations, Laplace achieves FID 7.96 vs cosine baseline 10.85 — 26.6% improvement — at identical compute and architecture [7].

Schedule Comparison

Schedule	Year	Key property	Known limitation
Linear	2020	Simple; baseline	Aggressive; non-zero terminal SNR; bad at low-res
Cosine	2021	Smooth; consistent signal across steps	Non-zero terminal SNR; under-noises at high-res
Sigmoid	2022	High-res stability; gradual transitions	Less adoption; application-specific tuning
EDM σ	2022	Principled σ-parameterisation; SOTA FID	Different parameterisation from ᾱ-based models
Laplace	2024	Theoretically motivated; best FID	Not yet default in major frameworks

Resolution and Scale Dependence

Chen (2023) [4] [14] demonstrates that the optimal schedule shifts toward noisier levels as resolution increases. At 256×256 the cosine schedule works well; at 1024×1024 the same schedule systematically under-noises images, producing an undertrained high-noise regime.

Practical fix: rather than redesigning the schedule per resolution, scale the input data by factor b and keep the schedule fixed. This shifts the entire log-SNR curve uniformly downward (more noise relative to signal) — a single-parameter adjustment sufficient for single-stage generation up to 1024×1024 [4].

The Terminal SNR Flaw

Lin et al. (WACV 2024) [5] identify a critical bug: no standard off-the-shelf schedule achieves SNR(T) = 0. At the final timestep t=T, the noised sample still contains residual signal. But at inference, the sampler starts from pure Gaussian noise — a condition the model was never trained on.

Consequences:

⚠ Stable Diffusion cannot generate very bright or very dark images — restricted to medium brightness [5]
⚠ Training/inference mismatch degrades sample quality across all resolutions [15]

Fixes [5]:

Rescale β schedule to enforce SNR(T) = 0 exactly
Switch from ε-prediction to v-prediction
Ensure sampler always initiates from the true final timestep
Rescale classifier-free guidance to prevent overexposure in early steps

VE vs VP: The SDE View

Song et al. (2021) [3] unified DDPM and SMLD as discretisations of two continuous-time SDE families [12]:

Variance Preserving (VP-SDE) — DDPM’s continuous limit:

dx = -½β(t) x dt + sqrt(β(t)) dw

The drift term prevents variance from exploding; all ᾱ_t-based schedules (linear, cosine, sigmoid) belong to this family.

Variance Exploding (VE-SDE) — SMLD’s continuous limit:

dx = sqrt(dσ²(t)/dt) dw

No drift; variance grows to infinity. EDM’s σ_max=80 approximates this regime — the distribution at maximum noise is effectively Gaussian [6].

Both families admit a reverse-time SDE solvable with the score function ∇_x log p(x_t), providing a unified generative mechanism [3].

Flow Matching: A Parallel Track

Flow matching (Lipman et al., 2022) bypasses the noise schedule entirely by learning a velocity field that maps noise to data via straight paths. Models including Stable Diffusion 3 and FLUX adopt rectified flow.

Key tradeoff [16]: with 5–10 inference steps, rectified flow matches DDPM quality at 50+ steps. For training, flow matching defaults to a uniform distribution in log-SNR space — equivalent to a specific schedule choice, empirically robust across resolutions. Rectified flow achieves better FID and CLIP score than rectified flow at equivalent step counts after path straightening.

Practical Guidance

Start with cosine for ≤256px tasks; linear is strictly worse for identical compute.
Enforce zero terminal SNR on any new training run — it is a free correctness fix that unblocks the full brightness range.
Use EDM or Laplace when designing training from scratch; Laplace’s 26% FID gain costs only a different schedule function.
Scale the input, not the schedule when increasing resolution; the b-scaling fix is a one-line change that handles 1024px.
Track log-SNR, not t when debugging: the nominal timestep is an arbitrary proxy; the actual quantity that controls learning difficulty is the signal-to-noise ratio.