← Notes
blog · April 6, 2026

Diffusion model

ML

Table of Contents

Introduction

A diffusion model is a generative model that learns to create data, such as images, by reversing a gradual corruption process.

Informally, it works in two stages:

  1. Forward diffusion: start with a real data sample and gradually add small amounts of noise until it becomes almost pure Gaussian noise
  2. Reverse diffusion: train a neural network to undo that noise step by step, so that starting from random noise it can generate a realistic sample.

Theoretical definition

forward diffusion process

Let x0q(x)x_0 \sim q(x) be a data point from the true data distribution. A diffusion model defines a latent Markov chain: x0,x1,,xTx_0, x_1, \dots, x_T, where:

The forward process gradually destroys structure by adding Gaussian noise:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}\left(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I\right)

where (0 < βt\beta_t < 1) is the noise schedule, which shrinks the signal and adds Gaussian noise in every step. Let αt=1βt,αˉt=s=1tαs\alpha_t = 1-\beta_t,\quad \bar{\alpha}_t = \prod_{s=1}^t \alpha_s. We can derive a closed form

q(xtx0)=N(αˉtx0,(1αˉt)I)\begin{equation} q(x_t | x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_{t}} x_0, (1-\bar{\alpha}_{t})I) \end{equation}

Prove by induction:
Base case (t=1):
q(x1x0)=N(α1x0,β1I)q(x_1|x_0) = \mathcal{N}(\sqrt{\alpha_1} x_0, \beta_1 I). αˉ1=s=11αs=α1,β1=1α1=1αˉ1\bar{\alpha}_1 = \prod_{s=1}^1 \alpha_s = \alpha_1, \beta_1 = 1 - \alpha_1 = 1 - \bar{\alpha}_1. Substituting αˉ1\bar{\alpha}_1 and (1αˉ1)(1 - \bar{\alpha}_1) into the transition equation yields:

q(x1x0)=N(αˉ1x0,(1αˉ1)I)q(x_1|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_1} x_0, (1 - \bar{\alpha}_1)I)

→ Base case holds
Inductive Step: Assume the expression holds at t1t-1, that is:

q(xt1x0)=N(αˉt1x0,(1αˉt1)I)q(x_{t-1}|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_{t-1}} x_0, (1-\bar{\alpha}_{t-1})I)

Using the reparameterization trick, we can write xt1x_{t-1} as:

xt1=αˉt1x0+1αˉt1ϵt1,where  ϵt1N(0,I)\begin{equation} x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} x_0 + \sqrt{1-\bar{\alpha}_{t-1}} \epsilon_{t-1}, \quad \text{where}\; \epsilon_{t-1} \sim \mathcal{N}(0, I) \end{equation}

by definition of the forward process: q(xtxt1)=N(αtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{\alpha_t} x_{t-1}, \beta_t I), we reparameterize xtx_t in the same way:

xt=αtxt1+βtϵtwhere  ϵtN(0,I)\begin{equation} x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{\beta_t} \epsilon_t \quad \text{where} \;\epsilon_t \sim \mathcal{N}(0, I) \end{equation}

substitute our reparameterized equation (2) into (3):

      xt=αt(αˉt1x0+1αˉt1ϵt1)+βtϵt      =αtαˉt1x0+αt(1αˉt1)ϵt1+βtϵt      using   αˉt=αtαˉt1      =αˉtx0+αtαˉtϵt1+βtϵt      \begin{align*}            x_t &= \sqrt{\alpha_t} \Big( \sqrt{\bar{\alpha}_{t-1}} x_0 + \sqrt{1-\bar{\alpha}_{t-1}} \epsilon_{t-1} \Big) + \sqrt{\beta_t} \epsilon_t \\            &= \sqrt{\alpha_t \bar{\alpha}_{t-1}} x_0 + \sqrt{\alpha_t(1-\bar{\alpha}_{t-1})} \epsilon_{t-1} + \sqrt{\beta_t} \epsilon_t \\            &\text{using }\; \bar{\alpha}_t = \alpha_t \bar{\alpha}_{t-1} \\            &=\sqrt{\bar{\alpha}_t} x_0 + \sqrt{\alpha_t - \bar{\alpha}_t} \epsilon_{t-1} + \sqrt{\beta_t} \epsilon_t            \end{align*}

Because ϵt1\epsilon_{t-1} and ϵt\epsilon_t are I.I.D. Gaussians, we can merge them using the sum of independent Gaussians identity (aX+bY=a2+b2ϵaX + bY = \sqrt{a^2 + b^2}\epsilon):

Variance=(αtαˉt)2+(βt)2=1αˉt\begin{align*} \text{Variance} &= (\sqrt{\alpha_t - \bar{\alpha}_t})^2 + (\sqrt{\beta_t})^2 \\ &= 1 - \bar{\alpha}_t \end{align*} xt=αˉtx0+1αˉtϵq(xtx0)=N(αˉtx0,(1αˉt)I)x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \\q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I)

So at any time tt, xtx_t is just a noisy version of x0x_0. Note that when we choose {βt}\{\beta_t\} such that αˉT0\bar\alpha_T \approx 0, the forward diffusion process completely diffuses all the information in original data: q(xTx0)=N(0,I)q(x_T|x_0) = \mathcal{N}(0, I)


reverse diffusion process

If we can reverse the above forward process and sample from q(xt1xt)q(x_{t-1} \mid x_t), we will be able to recreate the true sample from a Gaussian noise input. Unfortunately, we cannot easily estimate  because it needs to use the entire dataset and therefore we need to learn a model pθp_\theta to approximate these conditional probabilities.

The forward noise levels βt\beta_t are either fixed or learnable. A nice property is that when βt\beta_t is small (which is usually the case in diffusion models), the reverse conditional q(xt1xt)q(x_{t-1} \mid x_t) will have the same functional form of q(xtxt1)q(x_{t} \mid x_{t-1}), i.e. Gaussian (see 2.2 of [3] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S.. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning ). Under this assumption, modeling the reverse denoising step as Gaussian (i.e. choosing the functional form of pθ(xt1xt)p_\theta(x_{t-1}\mid x_t) as Gaussians) is justified because the true reverse dynamics of a small Gaussian corruption are themselves approximately Gaussian.

pθp_\theta learns a reverse Markov chain: pθ(xt1xt)p_\theta(x_{t-1}\mid x_t), typically parameterized as a Gaussian:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1}\mid x_t)=\mathcal{N}\left(x_{t-1}; \mu_\theta(x_t,t), \Sigma_\theta(x_t,t)\right)

The full reverse process (generative model):

pθ(x0:T)=p(xT)t=1Tpθ(xt1xt),where  p(xT)=N(0,I)p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1}\mid x_t), \quad \text{where}\; p(x_T)=\mathcal{N}(0,I)

Variational lower bound

([1] Weng, Lilian. (Jul 2021). What are diffusion models? Lil’Log. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/. )

We want to maximize the evidence across whole data distribution, i.e. maximize Eq(x0)logpθ(x0)\mathbb{E}_{q(x_0)} \log p_\theta(x_0)

Eq(x0)logpθ(x0)=Eq(x0)[log(pθ(x0,x1,,xT)dx1dx2dxT.)]=Eq(x0)[logpθ(x0:T)dx1:T]=Eq(x0)[logq(x1:Tx0)pθ(x0:T)q(x1:Tx0)dx1:T]=Eq(x0)logEq(x1:Tx0)[pθ(x0:T)q(x1:Tx0)]by Jensen’s inequalityEq(x0:T)logpθ(x0:T)q(x1:Tx0)=LVLB\begin{align*} \mathbb{E}_{q(x_0)} \log p_\theta(x_0) &= \mathbb{E}_{q(x_0)} \left[ \log \left( \int \cdots \int p_\theta(x_0,x_1,\dots,x_T)\,dx_1\,dx_2\cdots dx_T. \right) \right] \\ &= \mathbb{E}_{q(x_0)} \left[ \log \int p_\theta(x_{0:T})\,dx_{1:T} \right] \\ &= \mathbb{E}_{q(x_0)} \left[ \log \int q(x_{1:T}\mid x_0)\, \frac{p_\theta(x_{0:T})}{q(x_{1:T}\mid x_0)}\,dx_{1:T} \right] \\ &= \mathbb{E}_{q(x_0)} \log \mathbb{E}_{q(x_{1:T}\mid x_0)} \left[ \frac{p_\theta(x_{0:T})}{q(x_{1:T}\mid x_0)} \right] \\ &\text{by Jensen's inequality} \\ &\geq \mathbb{E}_{q(x_{0:T})} \log \frac{p_\theta(x_{0:T})}{q(x_{1:T}\mid x_0)} = L_{VLB} \end{align*}

Further reduce the variational lower bound: ([2] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ‘20) )

LVLB=Eq[logpθ(x0:T)q(x1:Tx0)]=Eq[logp(xT)t1logpθ(xt1xt)q(xtxt1)]=Eq[logp(xT)t>1logpθ(xt1xt)q(xtxt1)logpθ(x0x1)q(x1x0)]=Eq[logp(xT)t>1log(pθ(xt1xt)q(xt1xt,x0)q(xt1x0)q(xtx0))logpθ(x0x1)q(x1x0)]=Eq[logp(xT)q(xTx0)t>1logpθ(xt1xt)q(xt1xt,x0)logpθ(x0x1)]=Eq[DKL ⁣(q(xTx0)p(xT))LT+t>1DKL ⁣(q(xt1xt,x0)pθ(xt1xt))Lt1    logpθ(x0x1)L0].\begin{align*}L_{VLB}&= \mathbb{E}_q \left[ - \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)} \right] \\ &= \mathbb{E}_q \left[ - \log p(\mathbf{x}_T) - \sum_{t \ge 1} \log \frac{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} {q(\mathbf{x}_t\mid \mathbf{x}_{t-1})}\right] \\&= \mathbb{E}_q \left[ - \log p(\mathbf{x}_T) - \sum_{t>1} \log \frac{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} {q(\mathbf{x}_t\mid \mathbf{x}_{t-1})} - \log \frac{p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)} {q(\mathbf{x}_1\mid \mathbf{x}_0)}\right] \\&= \mathbb{E}_q \left[ - \log p(\mathbf{x}_T) - \sum_{t>1} \log \left( \frac{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} {q(\mathbf{x}_{t-1}\mid \mathbf{x}_t,\mathbf{x}_0)} \cdot \frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_0)} {q(\mathbf{x}_t\mid \mathbf{x}_0)} \right) - \log \frac{p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)} {q(\mathbf{x}_1\mid \mathbf{x}_0)}\right] \\&= \mathbb{E}_q \left[ - \log \frac{p(\mathbf{x}_T)}{q(\mathbf{x}_T\mid \mathbf{x}_0)} - \sum_{t>1} \log \frac{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} {q(\mathbf{x}_{t-1}\mid \mathbf{x}_t,\mathbf{x}_0)} - \log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)\right] \\&= \mathbb{E}_q \left[ \underbrace{D_{\mathrm{KL}}\!\left( q(\mathbf{x}_T\mid \mathbf{x}_0)\,\|\,p(\mathbf{x}_T) \right)}_{L_T} + \sum_{t>1} \underbrace{D_{\mathrm{KL}}\!\left( q(\mathbf{x}_{t-1}\mid \mathbf{x}_t,\mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t) \right)}_{L_{t-1}} \;\;\underbrace{ - \log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)}_{L_0}\right].\end{align*}
  • L0=logpθ(x0x1)L_0 = -\log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1): reconstruction / data likelihood term
    • ensures the last reverse step actually lands on the data distribution
    • In many DDPM parameterizations, this becomes a simple Gaussian likelihood term
  • Lt1=DKL ⁣(q(xt1xt,x0) pθ(xt1xt))L_{t-1}=D_{\mathrm{KL}}\!\left(q(\mathbf{x}_{t-1}\mid \mathbf{x}_t,\mathbf{x}_0)\,\|\, p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)\right): reverse-step matching / denoising term
    • This is the core of diffusion learning, actually trains the model’s reverse transition. Given xtx_t, learned denoising step should match the true reverse posterior induced by the forward process

forward process posteriors (true posterior) are tractable when conditioned on x0x_0, thus Lt1L_{t-1}is just Gaissian KL, which has a closed form

q(xt1xt,x0)=N(xt1;μt~(xt;x0),βt~I)where  μt~(xt;x0):=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt,  and  β~t:=1αˉt11αˉtβt\begin{align*} &q(x_{t-1} \mid x_t, x_0) = \mathcal N(x_{t-1}; \tilde{\mu_t}(x_t; x_0), \tilde{\beta_t}I) \\ &\text{where} \; \tilde{\mu_t}(x_t; x_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar\alpha_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}x_t, \;\text{and}\; \tilde\beta_t:=\frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t \end{align*}
  • LT=DKL ⁣(q(xTx0)p(xT))L_T = D_{\mathrm{KL}}\!\left(q(\mathbf{x}_T\mid \mathbf{x}_0)\,\|\,p(\mathbf{x}_T)\right): prior matching term
    • forces the final forward-noised variable xTx_T to match the simple prior, usually p(xT)=N(0,I)p(x_T)=\mathcal N(0,I)
    • In practice, with a sufficiently long diffusion chain, q(xTx0)q(x_T\mid x_0) is already very close to N(0,I)\mathcal N(0,I), and often forward process qq is not learnable, this term is then treated as constant.

From per-step KL to noise-matching loss

From previous session, our main learning signal is:

Lt1=Eq(x0,xt)[DKL(q(xt1xt,x0)    pθ(xt1xt))]where  q(xt1xt,x0)=N(μ~t(xt,x0),β~tI)and  pθ(xt1xt)=N(μθ(xt,t),σt2I),L_{t-1}=\mathbb{E}_{q(x_0, x_t)}\Big[D_{\mathrm{KL}}\big(q(x_{t-1}\mid x_t,x_0)\;\|\;p_\theta(x_{t-1}\mid x_t)\big)\Big] \\ \text{where} \;q(x_{t-1}\mid x_t,x_0) = \mathcal N\big(\tilde\mu_t(x_t,x_0), \tilde\beta_tI) \\ \text{and} \; p_\theta(x_{t-1}\mid x_t) = \mathcal N\big(\mu_\theta(x_t,t), \sigma_t^2 I\big), DKL(qpθ)=12[μ~tμθ2σt2+d(β~tσt21logβ~tσt2)]={12β~tμ~tμθ2if σt2=β~t12σt2μ~tμθ2+Ctif σt2β~t;but with fixed σt2\begin{align*} D_{\mathrm{KL}}(q\|p_\theta) &=\frac{1}{2}\left[\frac{\|\tilde\mu_t-\mu_\theta\|^2}{\sigma_t^2}+d\left(\frac{\tilde\beta_t}{\sigma_t^2}-1-\log\frac{\tilde\beta_t}{\sigma_t^2}\right)\right] \\ &= \begin{cases} \frac{1}{2\tilde\beta_t}\|\tilde\mu_t-\mu_\theta\|^2 & \text{if } \sigma_t^2 = \tilde\beta_t \\ \frac{1}{2\sigma_t^2}\|\tilde\mu_t-\mu_\theta\|^2 + C_t &\text{if } \sigma_t^2 \neq \tilde\beta_t;\text{but with fixed }\sigma_t^2 \end{cases} \end{align*}

In DDPM, σt2\sigma_t^2 is set to β~t\tilde\beta_t or βt\beta_t, thus the KL reduces to training the model mean to match the true posterior mean:

Lt1=Eq[12σt2μ~t(xt,x0)μθ(xt,t)2]+CL_{t-1}=\mathbb{E}_q\left[\frac{1}{2\sigma_t^2}\left\|\tilde\mu_t(x_t,x_0)-\mu_\theta(x_t,t)\right\|^2\right]+ C

From above: μt~(xt;x0):=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt\tilde{\mu_t}(x_t; x_0) := \frac{\sqrt{\bar{\alpha}{t-1}}\beta_t}{1 - \bar\alpha_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar\alpha{t-1})}{1-\bar\alpha_t}x_t, we substitute x0=1αˉt(xt1αˉtϵ)x_0=\frac{1}{\sqrt{\bar\alpha_t}}\left(x_t - \sqrt{1-\bar\alpha_t}\,\epsilon\right):

μ~t(xt,x0)=1αt(xtβt1αˉtϵ)\begin{equation} \tilde\mu_t(x_t,x_0)=\frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon\right) \end{equation}

We want to predict μ~t\tilde \mu_t with μθ\mu_\theta given xtx_t and tt. To do so, we parameterize the gaussian noise term by replacing the ϵ\epsilon in equation 4 with neural prediction ϵθ(xt,t)\epsilon_\theta (x_t, t). Then:

Lt1=Eq[12σt21αt(βt1αˉt)(ϵϵθ)2]+C=Eq[12σt2βt2αt(1αˉt)ϵϵθ(xt,t)2]+C\begin{align*} L_{t-1}&=\mathbb{E}_q\left[ \frac{1}{2\sigma_t^2} \left\| \frac{1}{\sqrt{\alpha_t}} \left( -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \right) (\epsilon - \epsilon_\theta) \right\|^2 \right]+ C \\ &=\mathbb{E}_q\left[ \frac{1}{2\sigma_t^2} \frac{\beta_t^2}{\alpha_t(1-\bar\alpha_t)} \left\|\epsilon-\epsilon_\theta(x_t,t)\right\|^2 \right]+ C \tag{5} \\ \end{align*}

We can see in Equation 5, the noise prediction loss is weighted by time-step-dependent constant. In DDPM, it is reduced to simpler form:

Lsimple=Ex0,ϵ,t[ϵϵθ(xt,t)2],L_{\text{simple}}=\mathbb{E}_{x_0,\epsilon,t}\left[\|\epsilon-\epsilon_\theta(x_t,t)\|^2\right],

It discards the weighting in Equation 5. It implicitly puts less focus on easy low-noise denoising steps and relatively more focus on hard high-noise steps, which tends to improve sample quality.


Training and Inference

Training:

Repeat until converged:

  1. Sample a clean data point: x0q(x0)x_0 \sim q(x_0)
  2. Sample a timestep uniformly: tUniform(1,,T)t \sim \mathrm{Uniform}({1,\dots,T})
  3. Sample Gaussian noise: ϵN(0,I)\epsilon \sim \mathcal N(0, I)
  4. Construct the noisy version of the data at timestep tt: xt=αˉt,x0+1αˉt,ϵx_t = \sqrt{\bar{\alpha}_t},x_0 + \sqrt{1-\bar{\alpha}_t},\epsilon
  5. Take a gradient descent step on the loss: ϵϵθ(xt,t)2\left| \epsilon - \epsilon_\theta(x_t, t) \right|^2. Equivalently, the optimization step is on: θϵϵθ(αˉtx0+1αˉtϵ,;t)2\nabla_\theta \left| \epsilon - \epsilon_\theta\left(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon,; t\right) \right|^2

Sampling

  1. Initialize from pure Gaussian noise: xTN(0,I)x_T \sim \mathcal N(0, I)

  2. For t=T,T1,,1t = T, T-1, \dots, 1:

    1. sample zN(0,I)if t>1,else z=0z \sim \mathcal N(0,I) \quad \text{if } t>1, \qquad \text{else } z=0
    2. compute the previous sample xt1=1αt(xt1αt1αˉtϵθ(xt,t))+σtzx_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}t}}\epsilon_\theta(x_t,t)\right)+\sigma_t z

    (xt1=μθ+σtzx_{t-1} = \mu_\theta + \sigma_t z)

  3. Return x0x_0


Comparison with VAE

A standard VAE has one latent variable zz:

logpθ(x)=L(θ,ϕ;x)+DKL(qϕ(zx)pθ(zx)) \log p_\theta(x) = \mathcal L(\theta,\phi;x) + D_{KL}(q_\phi(z\mid x)\|p_\theta(z\mid x))
  • In a standard VAE, we want to match qϕ(zx)q_\phi(z \mid x) (tractable encoder) to p(zx)p(z \mid x) (the intractable true posterior). We can only do this indirectly by maximizing the ELBO (thus minimizing the KL divergence).
  • One bottleneck variable zz must summarize the whole sample. The decoder must generate everything from that single representation. Global structure and fine detail are all pushed through one latent layer: if the latent dimension is too small, information is lost; if the decoder is too strong, posterior collapse can happen

In contrast, the fixed forward process in diffusion models makes all KL terms computable in closed form

  • qq is fixed, known and exact. Diffusion model then avoids learning an approximate inference network for the forward chain.
  • not one latent, but a whole hierarchy of progressively less noisy variables, generation is refined gradually

References

[1] Weng, Lilian. (Jul 2021). What are diffusion models? Lil’Log. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/.

[2] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ‘20)

[3] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S.. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning