ML notes
PCA
-
We want to find w that maximizes the covariance after dimension reduction
-
Inference:
- step 0: standardize the data (skip standardization only when you’re confident scales are comparable and variance magnitude is meaningful)
- step 1: estimate mean and calculate covariance from data
- step 2: calculate eigenvalues, eigenvectors of , (solve , where the equation becomes n-degree polynomial to solve for )
- step 3: projected with picked eigenvectors
EM algorithm
VAE
Given a dataset , we want to model the underlying distribution . A latent variable model assumes each observation is generated from a lower-dimensional latent code : first sample , then generate . Here is a fixed prior (typically ) and is a learned conditional distribution parameterized by . For our training objective, we wish to maximize the marginal log-likelihood (evidence):
True posterior and evidence are both intractable
By Bayes’ theorem:
The numerator is entirely tractable. The intractability comes entirely from the denominator (the marginal likelihood because
- There is no closed-form solution In a VAE, is parameterized by DNNs. This means mapping to involves passing variables through layers of matmuls and nonlinear activations. Because of these non-linearities, the function becomes extremely complex that we cannot simply write down a closed-form equation to solve the integral.
- Numerical intractability Since integral fails, one viable fallback is numerical integration: estimating the area under the curve using ,for example, Riemann sum. However, latent spaces are high-dimensional. If we want to estimate the integral by testing just points along each dimension, you would need to evaluate your neural network times for a data sample.
Because the denominator cannot be calculated analytically or numerically, the true posterior remains permanently locked away, forcing us to use variational inference to approximate it.
ELBO (evidence lower-bound) of marginal log-likelihood:
notations:
- : generative parameters (Decoder) - This model represents the assumption about how the world works: it takes a the latent variable and translates it into observable data / evidence . Therefore, defines the likelihood and the prior .
- : Variational Parameters (Encoder): Because the true posterior is intractable (explained below), we use a approximate posterior to approximate. defines
Since log is strictly concave, equality in Jensen holds iff is constant with respect to the expectation (). That is:
When posterior of of equals
Exact decomposition of Evidence to ELBO + KL
Implications:
- Evidence : represents how well the generative model explains the actual training data , marginalizing (integrating) out all the hidden variables . The ultimate goal of any generative model is to maximize this evidence. However, we cannot compute this directly because the integral over all possible is intractable.
- KL Divergence . Measures the approximation error between the approximate posterior and the true, mathematically intractable posterior . It’s always nonnegative, acts as a positive gap between the ELBO and the true marginal likelihood. In reality, we cannot compute this because is intractable
- ELBO : Since the KL divergence is always , we can rewrite the main equation as an inequality: → ELBO is a guaranteed lower bound on the evidence. The ELBO is the actual objective function we use to train the VAE. Unlike the marginal likelihood, the ELBO is tractable. When we train a VAE, we are maximizing the ELBO with respect to both and .
- By maximizing the ELBO with respect to , we are improving our decoder’s ability to generate the data (pushing the ceiling up).
- By maximizing the ELBO with respect to , we are minimizing the KL divergence gap (pulling the floor closer to the ceiling), making our encoder a better approximation of the true posterior.
Optimizing ELBO
- The reconstruction term measures how well the generative parameters reconstruct the original data given a latent variable sampled from the encoder (approximate posterior)
- The regularization term forces the predicted posterior to be close to the chose prior . Optimizing this term makes the latent space smooth, organized, and easy to sample from, and nearby latent points can correspond to similar outputs. If we remove this term, one degenerate solution is for to collapse to an almost deterministic point mass, i.e. a Gaussian with variance approaching zero. Then the model only learns to reconstruct each training example from its encoded latent code, without forcing those codes to match the prior . As a result, at test time, sampling may produce latent vectors in regions the decoder was never trained on (dead area), so generated samples are poor even though reconstruction loss is low.
The Training Loop:
-
Pass data through the encoder to output the parameters of (typically a mean vector and a variance vector ).
-
Sample a latent vector from this distribution. To make this step differentiable so backpropagation works, we use the reparameterization trick:
-
In the original reconstruction term , the expectation is over , and since is sampled from a distribution that depends on , it is not possible to backpropagate through the sampling operation for optimizing . The reparameterization trick rewrites the sample as
becomes a differentiable function of .
And because the distribution of does not depend on , we can move inside the expectation and compute gradients by standard backpropagation
-
-
Pass through the decoder to get the reconstruction .
-
Calculate the negative ELBO and backpropagate to update and .