A diffusion model is a generative model that learns to create data, such as images, by reversing a gradual corruption process.
Informally, it works in two stages:
Forward diffusion: start with a real data sample and gradually add small amounts of noise until it becomes almost pure Gaussian noise
Reverse diffusion: train a neural network to undo that noise step by step, so that starting from random noise it can generate a realistic sample.
Theoretical definition
forward diffusion process
Let x0∼q(x) be a data point from the true data distribution. A diffusion model defines a latent Markov chain:x0,x1,…,xT, where:
The forward process gradually destroys structure by adding Gaussian noise:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
where (0 < βt < 1) is the noise schedule, which shrinks the signal and adds Gaussian noise in every step. Let αt=1−βt,αˉt=∏s=1tαs. We can derive a closed form
q(xt∣x0)=N(αˉtx0,(1−αˉt)I)
Prove by induction: Base case (t=1): q(x1∣x0)=N(α1x0,β1I). αˉ1=∏s=11αs=α1,β1=1−α1=1−αˉ1. Substituting αˉ1 and (1−αˉ1) into the transition equation yields:
q(x1∣x0)=N(αˉ1x0,(1−αˉ1)I)
→ Base case holds Inductive Step: Assume the expression holds at t−1, that is:
q(xt−1∣x0)=N(αˉt−1x0,(1−αˉt−1)I)
Using the reparameterization trick, we can write xt−1 as:
So at any time t, xt is just a noisy version of x0. Note that when we choose {βt} such that αˉT≈0, the forward diffusion process completely diffuses all the information in original data: q(xT∣x0)=N(0,I)
reverse diffusion process
If we can reverse the above forward process and sample from q(xt−1∣xt), we will be able to recreate the true sample from a Gaussian noise input. Unfortunately, we cannot easily estimate because it needs to use the entire dataset and therefore we need to learn a model pθ to approximate these conditional probabilities.
The forward noise levels βt are either fixed or learnable. A nice property is that when βt is small (which is usually the case in diffusion models), the reverse conditional q(xt−1∣xt) will have the same functional form of q(xt∣xt−1), i.e. Gaussian (see 2.2 of [3] ). Under this assumption, modeling the reverse denoising step as Gaussian (i.e. choosing the functional form of pθ(xt−1∣xt) as Gaussians) is justified because the true reverse dynamics of a small Gaussian corruption are themselves approximately Gaussian.
pθ learns a reverse Markov chain: pθ(xt−1∣xt), typically parameterized as a Gaussian:
L0=−logpθ(x0∣x1): reconstruction / data likelihood term
ensures the last reverse step actually lands on the data distribution
In many DDPM parameterizations, this becomes a simple Gaussian likelihood term
Lt−1=DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt)): reverse-step matching / denoising term
This is the core of diffusion learning, actually trains the model’s reverse transition. Given xt, learned denoising step should match the true reverse posterior induced by the forward process
forward process posteriors (true posterior) are tractable when conditioned on x0, thus Lt−1is just Gaissian KL, which has a closed form
forces the final forward-noised variable xT to match the simple prior, usually p(xT)=N(0,I)
In practice, with a sufficiently long diffusion chain, q(xT∣x0) is already very close to N(0,I), and often forward process q is not learnable, this term is then treated as constant.
From per-step KL to noise-matching loss
From previous session, our main learning signal is:
Lt−1=Eq(x0,xt)[DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))]whereq(xt−1∣xt,x0)=N(μ~t(xt,x0),β~tI)andpθ(xt−1∣xt)=N(μθ(xt,t),σt2I),DKL(q∥pθ)=21[σt2∥μ~t−μθ∥2+d(σt2β~t−1−logσt2β~t)]={2β~t1∥μ~t−μθ∥22σt21∥μ~t−μθ∥2+Ctif σt2=β~tif σt2=β~t;but with fixed σt2
In DDPM, σt2 is set to β~t or βt, thus the KL reduces to training the model mean to match the true posterior mean:
Lt−1=Eq[2σt21∥μ~t(xt,x0)−μθ(xt,t)∥2]+C
From above: μt~(xt;x0):=1−αˉtαˉt−1βtx0+1−αˉtαt(1−αˉt−1)xt, we substitute x0=αˉt1(xt−1−αˉtϵ):
μ~t(xt,x0)=αt1(xt−1−αˉtβtϵ)
We want to predict μ~t with μθ given xt and t. To do so, we parameterize the gaussian noise term by replacing the ϵ in equation 4 with neural prediction ϵθ(xt,t). Then:
We can see in Equation 5, the noise prediction loss is weighted by time-step-dependent constant. In DDPM, it is reduced to simpler form:
Lsimple=Ex0,ϵ,t[∥ϵ−ϵθ(xt,t)∥2],
It discards the weighting in Equation 5. It implicitly puts less focus on easy low-noise denoising steps and relatively more focus on hard high-noise steps, which tends to improve sample quality.
Training and Inference
Training:
Repeat until converged:
Sample a clean data point: x0∼q(x0)
Sample a timestep uniformly: t∼Uniform(1,…,T)
Sample Gaussian noise: ϵ∼N(0,I)
Construct the noisy version of the data at timestep t: xt=αˉt,x0+1−αˉt,ϵ
Take a gradient descent step on the loss: ∣ϵ−ϵθ(xt,t)∣2. Equivalently, the optimization step is on: ∇θϵ−ϵθ(αˉtx0+1−αˉtϵ,;t)2
Sampling
Initialize from pure Gaussian noise: xT∼N(0,I)
For t=T,T−1,…,1:
sample z∼N(0,I)if t>1,else z=0
compute the previous sample xt−1=αt1(xt−1−αˉt1−αtϵθ(xt,t))+σtz
(xt−1=μθ+σtz)
Return x0
Comparison with VAE
A standard VAE has one latent variable z:
logpθ(x)=L(θ,ϕ;x)+DKL(qϕ(z∣x)∥pθ(z∣x))
In a standard VAE, we want to match qϕ(z∣x) (tractable encoder) to p(z∣x) (the intractable true posterior). We can only do this indirectly by maximizing the ELBO (thus minimizing the KL divergence).
One bottleneck variable z must summarize the whole sample. The decoder must generate everything from that single representation. Global structure and fine detail are all pushed through one latent layer: if the latent dimension is too small, information is lost; if the decoder is too strong, posterior collapse can happen
In contrast, the fixed forward process in diffusion models makes all KL terms computable in closed form
q is fixed, known and exact. Diffusion model then avoids learning an approximate inference network for the forward chain.
not one latent, but a whole hierarchy of progressively less noisy variables, generation is refined gradually
[2] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ‘20)
[3] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S.. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning