← Notes
note · February 8, 2026

ML notes

ML

PCA

  • We want to find w that maximizes the covariance after dimension reduction

    w=arg maxw1ni=1n(wTxiwTμ)2where  w=1=wT(1ni=1n(xiμ)(xi=μ)T)wL(w,λ)=wTCw+λ(1w2)(using Langrangian function)Find minimum:Lw=0Cw=λw w = \argmax_w \frac{1}{n}\sum_{i=1}^n (w^Tx^i - w^T\mu)^2 \quad \text{where} \;\|{w}\| = 1 \\ = w^T (\frac{1}{n} \sum_{i=1}^n (x^i-\mu)(x^i=\mu)^T)w\\ L(w, \lambda) = w^TCw + \lambda (1-\|w\|^2) \quad \text{(using Langrangian function)} \\ \text{Find minimum:} \quad \frac{\partial L}{\partial w} = 0 \rightarrow Cw = \lambda w
  • Inference:

    • step 0: standardize the data (skip standardization only when you’re confident scales are comparable and variance magnitude is meaningful)
    • step 1: estimate mean μ\mu and calculate covariance CC from data
    • step 2: calculate eigenvalues, eigenvectors of CC, (solve det(CλI)=0 \det(C-\lambda I) = 0, where the equation becomes n-degree polynomial to solve for λ1,λ2,...,λn\lambda_1, \lambda_2, ..., \lambda_n)
    • step 3: projected with picked eigenvectors
    zi=(ν1T(xiμ)/λ1ν2T(xiμ)/λ2νkT(xiμ)/λk)z^i = \begin{pmatrix} \nu^{1^{T}} (x^i - \mu) / \sqrt{\lambda_1} \\ \nu^{2^{T}} (x^i - \mu) / \sqrt{\lambda_2} \\ \cdot \\ \nu^{k^{T}} (x^i - \mu) / \sqrt{\lambda_k} \\ \end{pmatrix}

EM algorithm


VAE

Given a dataset XX, we want to model the underlying distribution p(x)p(x). A latent variable model assumes each observation xx is generated from a lower-dimensional latent code zz: first sample zp(z)z \sim p(z), then generate xpθ(xz)x \sim p_\theta(x|z). Here p(z)p(z) is a fixed prior (typically N(0,I)\mathcal{N}(0, I)) and pθ(xz)p_\theta(x|z) is a learned conditional distribution parameterized by θ\theta. For our training objective, we wish to maximize the marginal log-likelihood (evidence):

logpθ(x)=logpθ(x,z)dz=logpθ(xz)p(z)dz.\log p_\theta(x) = \log \int p_\theta(x, z)\,dz = \log \int p_\theta(x|z)p(z)\,dz.

True posterior pθ(zx)p_\theta(z |x) and evidence pθ(x)p_\theta(x) are both intractable

By Bayes’ theorem:

pθ(zx)=pθ(xz)p(z)pθ(x)=pθ(xz)p(z)pθ(xz)p(z)dzp_\theta(z|x) = \frac{p_\theta(x|z)p(z)}{p_\theta(x)} = \frac{p_\theta(x|z)p(z)}{\int p_\theta(x|z)p(z)dz}

The numerator is entirely tractable. The intractability comes entirely from the denominator (the marginal likelihood pθ(x)p_\theta(x) because

  1. There is no closed-form solution In a VAE, pθ(xz)p_\theta(x|z) is parameterized by DNNs. This means mapping zz to xx involves passing variables through layers of matmuls and nonlinear activations. Because of these non-linearities, the function pθ(xz)p(z)p_\theta(x|z)p(z) becomes extremely complex that we cannot simply write down a closed-form equation to solve the integral.
  2. Numerical intractability Since integral fails, one viable fallback is numerical integration: estimating the area under the curve using ,for example, Riemann sum. However, latent spaces are high-dimensional. If we want to estimate the integral by testing just kk points along each dimension, you would need to evaluate your neural network kdk^{d} times for a data sample.

Because the denominator cannot be calculated analytically or numerically, the true posterior pθ(zx)p_\theta(z|x) remains permanently locked away, forcing us to use variational inference to approximate it.

ELBO (evidence lower-bound) of marginal log-likelihood:

notations:

  • θ\theta: generative parameters (Decoder) - This model represents the assumption about how the world works: it takes a the latent variable zz and translates it into observable data / evidence xx. Therefore, θ\theta defines the likelihood pθ(xz)p_\theta(x|z) and the prior p(z)p(z).
  • ϕ\phi: Variational Parameters (Encoder): Because the true posterior pθ(xz)p_\theta(x|z) is intractable (explained below), we use a approximate posterior to approximate. ϕ\phi defines qϕ(zx)q_\phi(z|x)
logpθ(x)=logpθ(x,z)dz=logqϕ(zx)pθ(x,z)qϕ(zx)dz=logEqϕ(zx)[pθ(x,z)qϕ(zx)]Applying Jensen’s inequality: logEq[Y]Eq[logY]Eqϕ(zx)[logpθ(x,z)qϕ(zx)]=L(θ,ϕ;x)\begin{align*} \log p_\theta(x) &= \log \int p_\theta(x, z) dz \\ &= \log \int q_\phi(z | x) \frac{p_\theta(x, z)}{q_\phi(z|x)} dz\\ &= \log \mathbb E_{q_\phi(z\mid x)} \left[ \frac{p_\theta(x,z)}{q_\phi(z\mid x)} \right] \\ &\text{Applying Jensen's inequality: }\log \mathbb E_q[Y] \ge \mathbb E_q[\log Y] \\ &\ge \mathbb E_{q_\phi(z\mid x)} \left[ \log \frac{p_\theta(x,z)}{q_\phi(z\mid x)} \right] = \boxed{ \mathcal L(\theta,\phi;x)} \end{align*}

Since log is strictly concave, equality in Jensen holds iff YY is constant with respect to the expectation (qϕ(zx)q_\phi(z\mid x)). That is:

pθ(x,z)qϕ(zx)=cfor qϕ(zx)-almost every zpθ(x,z)=c×qϕ(zx)pθ(x)=cqϕ(zx)dz=cpθ(x,z)=pθ(x)qϕ(zx)qϕ(zx)=pθ(zx)\frac{p_\theta(x,z)}{q_\phi(z\mid x)} = c \qquad\text{for } q_\phi(z\mid x)\text{-almost every } z \\ p_\theta(x,z)= c \times q_\phi(z | x) \\ p_\theta(x)=c\int q_\phi(z\mid x)dz=c \\ p_\theta(x,z)=p_\theta(x)\,q_\phi(z\mid x) \\ \boxed{ q_\phi(z\mid x)=p_\theta(z\mid x)}

When posterior of of qq equals pp

Exact decomposition of Evidence to ELBO + KL

logpθ(x)=qϕ(zx)logpθ(x)dz=qϕ(zx)log(pθ(x,z)pθ(zx))dz=qϕ(zx)log(pθ(x,z)qϕ(zx)qϕ(zx)pθ(zx))dz=qϕ(zx)logpθ(x,z)qϕ(zx)dz+qϕ(zx)logqϕ(zx)pθ(zx)dzlogpθ(x)=L(θ,ϕ;x)+DKL(qϕ(zx)pθ(zx))\begin{align*} \log p_\theta (x) &= \int q_\phi (z | x) \log p_\theta(x) dz \\ &= \int q_\phi(z\mid x)\log\left(\frac{p_\theta(x,z)}{p_\theta(z\mid x)}\right)dz \\ &= \int q_\phi(z\mid x)\log\left( \frac{p_\theta(x,z)}{q_\phi(z\mid x)} \cdot \frac{q_\phi(z\mid x)}{p_\theta(z\mid x)} \right)dz \\ &= \int q_\phi(z\mid x)\log\frac{p_\theta(x,z)}{q_\phi(z\mid x)}\,dz + \int q_\phi(z\mid x)\log\frac{q_\phi(z\mid x)}{p_\theta(z\mid x)}\,dz \end{align*} \\ \boxed{ \log p_\theta(x) = \mathcal L(\theta,\phi;x) + D_{KL}(q_\phi(z\mid x)\|p_\theta(z\mid x)) }

Implications:

  • Evidence logpθ(x)\log p_\theta(x): represents how well the generative model θ\theta explains the actual training data xx, marginalizing (integrating) out all the hidden variables zz. The ultimate goal of any generative model is to maximize this evidence. However, we cannot compute this directly because the integral over all possible zzis intractable.
  • KL Divergence DKL(qϕ(zx)pθ(zx))D_{KL}(q_\phi(z|x) \,\Vert\, p_\theta(z|x)). Measures the approximation error between the approximate posterior qϕ(zx)q_\phi(z|x) and the true, mathematically intractable posterior pθ(zx)p_\theta(z|x). It’s always nonnegative, acts as a positive gap between the ELBO and the true marginal likelihood. In reality, we cannot compute this because pθ(zx)p_\theta(z|x) is intractable
  • ELBO L(θ,ϕ;x)\mathcal{L}(\theta,\phi;x): Since the KL divergence is always 0\ge 0, we can rewrite the main equation as an inequality: logpθ(x)L(θ,ϕ;x)\log p_\theta(x) \ge \mathcal{L}(\theta,\phi;x) → ELBO is a guaranteed lower bound on the evidence. The ELBO is the actual objective function we use to train the VAE. Unlike the marginal likelihood, the ELBO is tractable. When we train a VAE, we are maximizing the ELBO with respect to both θ\theta and ϕ\phi.
    • By maximizing the ELBO with respect to θ\theta, we are improving our decoder’s ability to generate the data (pushing the ceiling up).
    • By maximizing the ELBO with respect to ϕ\phi, we are minimizing the KL divergence gap (pulling the floor closer to the ceiling), making our encoder a better approximation of the true posterior.

Optimizing ELBO

L(θ,ϕ;x)=Ezqϕ(zx)[logpθ(xz)]reconstruction termDKL(qϕ(zx)p(z))regularization term\mathcal{L}(\theta, \phi; x) = \underbrace{\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{reconstruction term}} - \underbrace{D_{KL}(q_\phi(z|x) \,\Vert\, p(z))}_{\text{regularization term}}
  • The reconstruction term measures how well the generative parameters θ\theta reconstruct the original data xx given a latent variable zz sampled from the encoder (approximate posterior) qϕ(zx)q_\phi (z|x)
  • The regularization term forces the predicted posterior to be close to the chose prior p(z)p(z). Optimizing this term makes the latent space smooth, organized, and easy to sample from, and nearby latent points can correspond to similar outputs. If we remove this term, one degenerate solution is for qϕ(zx)q_\phi(z|x) to collapse to an almost deterministic point mass, i.e. a Gaussian with variance approaching zero. Then the model only learns to reconstruct each training example from its encoded latent code, without forcing those codes to match the prior p(z)p(z). As a result, at test time, sampling zp(z)z \sim p(z) may produce latent vectors in regions the decoder was never trained on (dead area), so generated samples are poor even though reconstruction loss is low.

The Training Loop:

  1. Pass data xx through the encoder to output the parameters of qϕ(zx)q_\phi(z|x) (typically a mean vector μ\mu and a variance vector σ2\sigma^2).

  2. Sample a latent vector zz from this distribution. To make this step differentiable so backpropagation works, we use the reparameterization trick:

    • In the original reconstruction term Ezqϕ(zx)[logpθ(xz)]\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)], the expectation is over zqϕ(zx)z \sim q_\phi(z|x), and since zz is sampled from a distribution that depends on ϕ\phi, it is not possible to backpropagate through the sampling operation for optimizing ϕ\phi. The reparameterization trick rewrites the sample as

      z=μϕ(x)+σϕ(x)ϵ,ϵN(0,I)z=\mu_\phi(x)+\sigma_\phi(x)\odot \epsilon,\qquad \epsilon\sim \mathcal{N}(0, I)

      zz becomes a differentiable function of ϕ\phi.

      Ezqϕ(zx)[logpθ(xz)]=EϵN(0,I)[logpθ(xμϕ(x)+σϕ(x)ϵ)]\mathbb E_{z\sim q_\phi(z\mid x)}[\log p_\theta(x\mid z)]=\mathbb E_{\epsilon\sim \mathcal N(0,I)}[\log p_\theta(x\mid \mu_\phi(x)+\sigma_\phi(x)\odot \epsilon)]

      And because the distribution of ϵ\epsilon does not depend on ϕ\phi, we can move ϕ\nabla_\phi inside the expectation and compute gradients by standard backpropagation

  3. Pass zz through the decoder to get the reconstruction pθ(xz)p_\theta(x|z).

  4. Calculate the negative ELBO and backpropagate to update θ\theta and ϕ\phi.