← Notes
blog · April 6, 2026

Diffusion model

ML

Table of Contents

Introduction

A diffusion model is a generative model that learns to create data, such as images, by reversing a gradual corruption process.

Informally, it works in two stages:

  1. Forward diffusion: start with a real data sample and gradually add small amounts of noise until it becomes almost pure Gaussian noise
  2. Reverse diffusion: train a neural network to undo that noise step by step, so that starting from random noise it can generate a realistic sample.

Theoretical definition

forward diffusion process

Let x0q(x)x_0 \sim q(x) be a data point from the true data distribution. A diffusion model defines a latent Markov chain: x0,x1,,xTx_0, x_1, \dots, x_T, where:

The forward process gradually destroys structure by adding Gaussian noise:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}\left(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I\right)

where (0 < βt\beta_t < 1) is the noise schedule, which shrinks the signal and adds Gaussian noise in every step. Let αt=1βt,αˉt=s=1tαs\alpha_t = 1-\beta_t,\quad \bar{\alpha}_t = \prod_{s=1}^t \alpha_s. We can derive a closed form

q(xtx0)=N(αˉtx0,(1αˉt)I)\begin{equation} q(x_t | x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_{t}} x_0, (1-\bar{\alpha}_{t})I) \end{equation}

Prove by induction:
Base case (t=1):
q(x1x0)=N(α1x0,β1I)q(x_1|x_0) = \mathcal{N}(\sqrt{\alpha_1} x_0, \beta_1 I). αˉ1=s=11αs=α1,β1=1α1=1αˉ1\bar{\alpha}_1 = \prod_{s=1}^1 \alpha_s = \alpha_1, \beta_1 = 1 - \alpha_1 = 1 - \bar{\alpha}_1. Substituting αˉ1\bar{\alpha}_1 and (1αˉ1)(1 - \bar{\alpha}_1) into the transition equation yields:

q(x1x0)=N(αˉ1x0,(1αˉ1)I)q(x_1|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_1} x_0, (1 - \bar{\alpha}_1)I)

→ Base case holds
Inductive Step: Assume the expression holds at t1t-1, that is:

q(xt1x0)=N(αˉt1x0,(1αˉt1)I)q(x_{t-1}|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_{t-1}} x_0, (1-\bar{\alpha}_{t-1})I)

Using the reparameterization trick, we can write xt1x_{t-1} as:

xt1=αˉt1x0+1αˉt1ϵt1,where  ϵt1N(0,I)\begin{equation} x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} x_0 + \sqrt{1-\bar{\alpha}_{t-1}} \epsilon_{t-1}, \quad \text{where}\; \epsilon_{t-1} \sim \mathcal{N}(0, I) \end{equation}

by definition of the forward process: q(xtxt1)=N(αtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{\alpha_t} x_{t-1}, \beta_t I), we reparameterize xtx_t in the same way:

xt=αtxt1+βtϵtwhere  ϵtN(0,I)\begin{equation} x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{\beta_t} \epsilon_t \quad \text{where} \;\epsilon_t \sim \mathcal{N}(0, I) \end{equation}

substitute our reparameterized equation (2) into (3):

      xt=αt(αˉt1x0+1αˉt1ϵt1)+βtϵt      =αtαˉt1x0+αt(1αˉt1)ϵt1+βtϵt      using   αˉt=αtαˉt1      =αˉtx0+αtαˉtϵt1+βtϵt      \begin{align*}            x_t &= \sqrt{\alpha_t} \Big( \sqrt{\bar{\alpha}_{t-1}} x_0 + \sqrt{1-\bar{\alpha}_{t-1}} \epsilon_{t-1} \Big) + \sqrt{\beta_t} \epsilon_t \\            &= \sqrt{\alpha_t \bar{\alpha}_{t-1}} x_0 + \sqrt{\alpha_t(1-\bar{\alpha}_{t-1})} \epsilon_{t-1} + \sqrt{\beta_t} \epsilon_t \\            &\text{using }\; \bar{\alpha}_t = \alpha_t \bar{\alpha}_{t-1} \\            &=\sqrt{\bar{\alpha}_t} x_0 + \sqrt{\alpha_t - \bar{\alpha}_t} \epsilon_{t-1} + \sqrt{\beta_t} \epsilon_t            \end{align*}

Because ϵt1\epsilon_{t-1} and ϵt\epsilon_t are I.I.D. Gaussians, we can merge them using the sum of independent Gaussians identity (aX+bY=a2+b2ϵaX + bY = \sqrt{a^2 + b^2}\epsilon):

Variance=(αtαˉt)2+(βt)2=1αˉt\begin{align*} \text{Variance} &= (\sqrt{\alpha_t - \bar{\alpha}_t})^2 + (\sqrt{\beta_t})^2 \\ &= 1 - \bar{\alpha}_t \end{align*} xt=αˉtx0+1αˉtϵq(xtx0)=N(αˉtx0,(1αˉt)I)x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \\q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I)

So at any time tt, xtx_t is just a noisy version of x0x_0. Note that when we choose {βt}\{\beta_t\} such that αˉT0\bar\alpha_T \approx 0, the forward diffusion process completely diffuses all the information in original data: q(xTx0)=N(0,I)q(x_T|x_0) = \mathcal{N}(0, I)


reverse diffusion process

If we can reverse the above forward process and sample from q(xt1xt)q(x_{t-1} \mid x_t), we will be able to recreate the true sample from a Gaussian noise input. Unfortunately, we cannot easily estimate  because it needs to use the entire dataset and therefore we need to learn a model pθp_\theta to approximate these conditional probabilities.

The forward noise levels βt\beta_t are either fixed or learnable. A nice property is that when βt\beta_t is small (which is usually the case in diffusion models), the reverse conditional q(xt1xt)q(x_{t-1} \mid x_t) will have the same functional form of q(xtxt1)q(x_{t} \mid x_{t-1}), i.e. Gaussian (see 2.2 of [3] ). Under this assumption, modeling the reverse denoising step as Gaussian (i.e. choosing the functional form of pθ(xt1xt)p_\theta(x_{t-1}\mid x_t) as Gaussians) is justified because the true reverse dynamics of a small Gaussian corruption are themselves approximately Gaussian.

pθp_\theta learns a reverse Markov chain: pθ(xt1xt)p_\theta(x_{t-1}\mid x_t), typically parameterized as a Gaussian:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1}\mid x_t)=\mathcal{N}\left(x_{t-1}; \mu_\theta(x_t,t), \Sigma_\theta(x_t,t)\right)

The full reverse process (generative model):

pθ(x0:T)=p(xT)t=1Tpθ(xt1xt),where  p(xT)=N(0,I)p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1}\mid x_t), \quad \text{where}\; p(x_T)=\mathcal{N}(0,I)

Variational lower bound

([1] )

We want to maximize the evidence across whole data distribution, i.e. maximize Eq(x0)logpθ(x0)\mathbb{E}_{q(x_0)} \log p_\theta(x_0)

Eq(x0)logpθ(x0)=Eq(x0)[log(pθ(x0,x1,,xT)dx1dx2dxT.)]=Eq(x0)[logpθ(x0:T)dx1:T]=Eq(x0)[logq(x1:Tx0)pθ(x0:T)q(x1:Tx0)dx1:T]=Eq(x0)logEq(x1:Tx0)[pθ(x0:T)q(x1:Tx0)]by Jensen’s inequalityEq(x0:T)logpθ(x0:T)q(x1:Tx0)=LVLB\begin{align*} \mathbb{E}_{q(x_0)} \log p_\theta(x_0) &= \mathbb{E}_{q(x_0)} \left[ \log \left( \int \cdots \int p_\theta(x_0,x_1,\dots,x_T)\,dx_1\,dx_2\cdots dx_T. \right) \right] \\ &= \mathbb{E}_{q(x_0)} \left[ \log \int p_\theta(x_{0:T})\,dx_{1:T} \right] \\ &= \mathbb{E}_{q(x_0)} \left[ \log \int q(x_{1:T}\mid x_0)\, \frac{p_\theta(x_{0:T})}{q(x_{1:T}\mid x_0)}\,dx_{1:T} \right] \\ &= \mathbb{E}_{q(x_0)} \log \mathbb{E}_{q(x_{1:T}\mid x_0)} \left[ \frac{p_\theta(x_{0:T})}{q(x_{1:T}\mid x_0)} \right] \\ &\text{by Jensen's inequality} \\ &\geq \mathbb{E}_{q(x_{0:T})} \log \frac{p_\theta(x_{0:T})}{q(x_{1:T}\mid x_0)} = L_{VLB} \end{align*}

Further reduce the variational lower bound: ([2] )

LVLB=Eq[logpθ(x0:T)q(x1:Tx0)]=Eq[logp(xT)t1logpθ(xt1xt)q(xtxt1)]=Eq[logp(xT)t>1logpθ(xt1xt)q(xtxt1)logpθ(x0x1)q(x1x0)]=Eq[logp(xT)t>1log(pθ(xt1xt)q(xt1xt,x0)q(xt1x0)q(xtx0))logpθ(x0x1)q(x1x0)]=Eq[logp(xT)q(xTx0)t>1logpθ(xt1xt)q(xt1xt,x0)logpθ(x0x1)]=Eq[DKL ⁣(q(xTx0)p(xT))LT+t>1DKL ⁣(q(xt1xt,x0)pθ(xt1xt))Lt1    logpθ(x0x1)L0].\begin{align*}L_{VLB}&= \mathbb{E}_q \left[ - \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)} \right] \\ &= \mathbb{E}_q \left[ - \log p(\mathbf{x}_T) - \sum_{t \ge 1} \log \frac{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} {q(\mathbf{x}_t\mid \mathbf{x}_{t-1})}\right] \\&= \mathbb{E}_q \left[ - \log p(\mathbf{x}_T) - \sum_{t>1} \log \frac{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} {q(\mathbf{x}_t\mid \mathbf{x}_{t-1})} - \log \frac{p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)} {q(\mathbf{x}_1\mid \mathbf{x}_0)}\right] \\&= \mathbb{E}_q \left[ - \log p(\mathbf{x}_T) - \sum_{t>1} \log \left( \frac{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} {q(\mathbf{x}_{t-1}\mid \mathbf{x}_t,\mathbf{x}_0)} \cdot \frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_0)} {q(\mathbf{x}_t\mid \mathbf{x}_0)} \right) - \log \frac{p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)} {q(\mathbf{x}_1\mid \mathbf{x}_0)}\right] \\&= \mathbb{E}_q \left[ - \log \frac{p(\mathbf{x}_T)}{q(\mathbf{x}_T\mid \mathbf{x}_0)} - \sum_{t>1} \log \frac{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} {q(\mathbf{x}_{t-1}\mid \mathbf{x}_t,\mathbf{x}_0)} - \log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)\right] \\&= \mathbb{E}_q \left[ \underbrace{D_{\mathrm{KL}}\!\left( q(\mathbf{x}_T\mid \mathbf{x}_0)\,\|\,p(\mathbf{x}_T) \right)}_{L_T} + \sum_{t>1} \underbrace{D_{\mathrm{KL}}\!\left( q(\mathbf{x}_{t-1}\mid \mathbf{x}_t,\mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t) \right)}_{L_{t-1}} \;\;\underbrace{ - \log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)}_{L_0}\right].\end{align*}
  • L0=logpθ(x0x1)L_0 = -\log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1): reconstruction / data likelihood term
    • ensures the last reverse step actually lands on the data distribution
    • In many DDPM parameterizations, this becomes a simple Gaussian likelihood term
  • Lt1=DKL ⁣(q(xt1xt,x0) pθ(xt1xt))L_{t-1}=D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{t-1}\mid \mathbf{x}_t,\mathbf{x}_0)\,\|\, p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)\right): reverse-step matching / denoising term
    • This is the core of diffusion learning, actually trains the model’s reverse transition. Given xtx_t, learned denoising step should match the true reverse posterior induced by the forward process

forward process posteriors (true posterior) are tractable when conditioned on x0x_0, thus Lt1L_{t-1}is just Gaissian KL, which has a closed form

q(xt1xt,x0)=N(xt1;μt~(xt;x0),βt~I)where  μt~(xt;x0):=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt,  and  β~t:=1αˉt11αˉtβt\begin{align*} &q(x_{t-1} \mid x_t, x_0) = \mathcal N(x_{t-1}; \tilde{\mu_t}(x_t; x_0), \tilde{\beta_t}I) \\ &\text{where} \; \tilde{\mu_t}(x_t; x_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar\alpha_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}x_t, \;\text{and}\; \tilde\beta_t:=\frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t \end{align*}
  • LT=DKL ⁣(q(xTx0)p(xT))L_T = D_{\mathrm{KL}}\!\left(q(\mathbf{x}_T\mid \mathbf{x}_0)\,\|\,p(\mathbf{x}_T)\right): prior matching term
    • forces the final forward-noised variable xTx_T to match the simple prior, usually p(xT)=N(0,I)p(x_T)=\mathcal N(0,I)
    • In practice, with a sufficiently long diffusion chain, q(xTx0)q(x_T\mid x_0) is already very close to N(0,I)\mathcal N(0,I), and often forward process qq is not learnable, this term is then treated as constant.

From per-step KL to noise-matching loss

From previous session, our main learning signal is:

Lt1=Eq(x0,xt)[DKL(q(xt1xt,x0)    pθ(xt1xt))]where  q(xt1xt,x0)=N(μ~t(xt,x0),β~tI)and  pθ(xt1xt)=N(μθ(xt,t),σt2I),L_{t-1}=\mathbb{E}_{q(x_0, x_t)}\Big[D_{\mathrm{KL}}\big(q(x_{t-1}\mid x_t,x_0)\;\|\;p_\theta(x_{t-1}\mid x_t)\big)\Big] \\ \text{where} \;q(x_{t-1}\mid x_t,x_0) = \mathcal N\big(\tilde\mu_t(x_t,x_0), \tilde\beta_tI) \\ \text{and} \; p_\theta(x_{t-1}\mid x_t) = \mathcal N\big(\mu_\theta(x_t,t), \sigma_t^2 I\big), DKL(qpθ)=12[μ~tμθ2σt2+d(β~tσt21logβ~tσt2)]={12β~tμ~tμθ2if σt2=β~t12σt2μ~tμθ2+Ctif σt2β~t;but with fixed σt2\begin{align*} D_{\mathrm{KL}}(q\|p_\theta) &=\frac{1}{2}\left[\frac{\|\tilde\mu_t-\mu_\theta\|^2}{\sigma_t^2}+d\left(\frac{\tilde\beta_t}{\sigma_t^2}-1-\log\frac{\tilde\beta_t}{\sigma_t^2}\right)\right] \\ &= \begin{cases} \frac{1}{2\tilde\beta_t}\|\tilde\mu_t-\mu_\theta\|^2 & \text{if } \sigma_t^2 = \tilde\beta_t \\ \frac{1}{2\sigma_t^2}\|\tilde\mu_t-\mu_\theta\|^2 + C_t &\text{if } \sigma_t^2 \neq \tilde\beta_t;\text{but with fixed }\sigma_t^2 \end{cases} \end{align*}

In DDPM, σt2\sigma_t^2 is set to β~t\tilde\beta_t or βt\beta_t, thus the KL reduces to training the model mean to match the true posterior mean:

Lt1=Eq[12σt2μ~t(xt,x0)μθ(xt,t)2]+CL_{t-1}=\mathbb{E}_q\left[\frac{1}{2\sigma_t^2}\left\|\tilde\mu_t(x_t,x_0)-\mu_\theta(x_t,t)\right\|^2\right]+ C

From above: μt~(xt;x0):=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt\tilde{\mu_t}(x_t; x_0) := \frac{\sqrt{\bar{\alpha}{t-1}}\beta_t}{1 - \bar\alpha_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar\alpha{t-1})}{1-\bar\alpha_t}x_t, we substitute x0=1αˉt(xt1αˉtϵ)x_0=\frac{1}{\sqrt{\bar\alpha_t}}\left(x_t - \sqrt{1-\bar\alpha_t}\,\epsilon\right):

μ~t(xt,x0)=1αt(xtβt1αˉtϵ)\begin{equation} \tilde\mu_t(x_t,x_0)=\frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon\right) \end{equation}

We want to predict μ~t\tilde \mu_t with μθ\mu_\theta given xtx_t and tt. To do so, we parameterize the gaussian noise term by replacing the ϵ\epsilon in equation 4 with neural prediction ϵθ(xt,t)\epsilon_\theta (x_t, t). Then:

Lt1=Eq[12σt21αt(βt1αˉt)(ϵϵθ)2]+C=Eq[12σt2βt2αt(1αˉt)ϵϵθ(xt,t)2]+C\begin{align*} L_{t-1}&=\mathbb{E}_q\left[ \frac{1}{2\sigma_t^2} \left\| \frac{1}{\sqrt{\alpha_t}} \left( -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \right) (\epsilon - \epsilon_\theta) \right\|^2 \right]+ C \\ &=\mathbb{E}_q\left[ \frac{1}{2\sigma_t^2} \frac{\beta_t^2}{\alpha_t(1-\bar\alpha_t)} \left\|\epsilon-\epsilon_\theta(x_t,t)\right\|^2 \right]+ C \tag{5} \\ \end{align*}

We can see in Equation 5, the noise prediction loss is weighted by time-step-dependent constant. In DDPM, it is reduced to simpler form:

Lsimple=Ex0,ϵ,t[ϵϵθ(xt,t)2],L_{\text{simple}}=\mathbb{E}_{x_0,\epsilon,t}\left[\|\epsilon-\epsilon_\theta(x_t,t)\|^2\right],

It discards the weighting in Equation 5. It implicitly puts less focus on easy low-noise denoising steps and relatively more focus on hard high-noise steps, which tends to improve sample quality.


Training and Inference

Training:

Repeat until converged:

  1. Sample a clean data point: x0q(x0)x_0 \sim q(x_0)
  2. Sample a timestep uniformly: tUniform(1,,T)t \sim \mathrm{Uniform}({1,\dots,T})
  3. Sample Gaussian noise: ϵN(0,I)\epsilon \sim \mathcal N(0, I)
  4. Construct the noisy version of the data at timestep tt: xt=αˉt,x0+1αˉt,ϵx_t = \sqrt{\bar{\alpha}_t},x_0 + \sqrt{1-\bar{\alpha}_t},\epsilon
  5. Take a gradient descent step on the loss: ϵϵθ(xt,t)2\left| \epsilon - \epsilon_\theta(x_t, t) \right|^2. Equivalently, the optimization step is on: θϵϵθ(αˉtx0+1αˉtϵ,;t)2\nabla_\theta \left| \epsilon - \epsilon_\theta\left(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon,; t\right) \right|^2

Sampling

  1. Initialize from pure Gaussian noise: xTN(0,I)x_T \sim \mathcal N(0, I)

  2. For t=T,T1,,1t = T, T-1, \dots, 1:

    1. sample zN(0,I)if t>1,else z=0z \sim \mathcal N(0,I) \quad \text{if } t>1, \qquad \text{else } z=0
    2. compute the previous sample xt1=1αt(xt1αt1αˉtϵθ(xt,t))+σtzx_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}t}}\epsilon_\theta(x_t,t)\right)+\sigma_t z

    (xt1=μθ+σtzx_{t-1} = \mu_\theta + \sigma_t z)

  3. Return x0x_0


Comparison with VAE

A standard VAE has one latent variable zz:

logpθ(x)=L(θ,ϕ;x)+DKL(qϕ(zx)pθ(zx)) \log p_\theta(x) = \mathcal L(\theta,\phi;x) + D_{KL}(q_\phi(z\mid x)\|p_\theta(z\mid x))
  • In a standard VAE, we want to match qϕ(zx)q_\phi(z \mid x) (tractable encoder) to p(zx)p(z \mid x) (the intractable true posterior). We can only do this indirectly by maximizing the ELBO (thus minimizing the KL divergence).
  • One bottleneck variable zz must summarize the whole sample. The decoder must generate everything from that single representation. Global structure and fine detail are all pushed through one latent layer: if the latent dimension is too small, information is lost; if the decoder is too strong, posterior collapse can happen

In contrast, the fixed forward process in diffusion models makes all KL terms computable in closed form

  • qq is fixed, known and exact. Diffusion model then avoids learning an approximate inference network for the forward chain.
  • not one latent, but a whole hierarchy of progressively less noisy variables, generation is refined gradually

References

[1] Weng, Lilian. (Jul 2021). What are diffusion models? Lil’Log. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/.

[2] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ‘20)

[3] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S.. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning