The Evidence Lower Bound (ELBO)

study

The Evidence Lower Bound (ELBO)

Given a training set ${x_1, \ldots, x_N}$ that follows an unknown distribution $\mu_X$, we aim to fit a model $p_\theta(x, z)$ to it, maximizing:

  • If we do not have an analytical form of the marginal $p_\theta(x_n)$ but only the expression of $p_\theta(x_n, z)$, we can get an estimate of the marginal by sampling $z$ with any distribution $q$: $$ p_\theta(x_n) = \int p_\theta(x_n, z) , dz = \int \frac{p_\theta(x_n, z)}{q(z)} q(z) , dz = \mathbb{E}{Z \sim q(z)} \left[ \frac{p\theta(x_n, Z)}{q(Z)} \right] $$

The Evidence Lower Bound (ELBO):

  • If we do not have an analytical form of the marginal $p_\theta(x_n)$ but only the expression of $p_\theta(x_n, z)$, we can get an estimate of the marginal by sampling $z$ with any distribution $q$ and maximize: $$ \frac{p_\theta(x_n, Z)}{q(Z)} $$

  • But we want to maximize $\log p_\theta(x_n)$. A natural solution would be to maximize the log of $\frac{p_\theta(x_n, Z)}{q(Z)}$: $$ \log \left(\frac{p_\theta(x_n, Z)}{q(Z)}\right) $$ Since the KL divergence is positive, we get that the log of our estimator is on average a lower bound of the quantity we want to estimate.

  • However, this does not maximize $\log p_\theta(x_n)$, but a lower bound of it, since the KL divergence is positive. This maximization push that KL term down, it also aligns $p_\theta(z|x_n)$ and $q(z)$, and we may get a worse $p_\theta(x_n)$ to bring $p_\theta(z|x_n)$ closer to $q(z)$.

  • However, all this analysis is valid if $q$ is a parameterized function $q_\alpha(z|x_n)$ of $x_n$. In that case, if we optimize both parameters $\theta, \alpha$ to maximize, it both maximizes $\log p_\theta(x_n)$ and brings $q_\alpha(z|x_n)$ closer to $p_\theta(z|x_n)$.

Given a training set ${x_1, \ldots, x_N}$ that follows an unknown distribution $\mu_X$, we aim to fit a model $p_\theta(x, z)$ to this data, maximizing the likelihood of the observed data under the model.

Estimating the Marginal Distribution

In scenarios where we do not have an analytical form of the marginal probability $p_\theta(x_n)$ but only have the joint probability $p_\theta(x_n, z)$, we need a method to estimate the marginal probability. This can be achieved using the following technique:

$$ p_\theta(x_n) = \int p_\theta(x_n, z) , dz $$

However, directly computing this integral may not be feasible if the distribution over $z$ is complex. Instead, we can use an importance sampling approach where $z$ is sampled from an arbitrary distribution $q(z)$, known as the proposal distribution. This leads to:

$$ p_\theta(x_n) = \int \frac{p_\theta(x_n, z)}{q(z)} q(z) , dz = \mathbb{E}{Z \sim q(z)} \left[ \frac{p\theta(x_n, Z)}{q(Z)} \right] $$

Here, the integral is approximated by the expected value under the distribution $q(z)$.

The Evidence Lower Bound (ELBO)

To maximize the likelihood of our model parameters $\theta$ with respect to the observed data, we consider maximizing the log likelihood:

$$ \log p_\theta(x_n) = \log \left( \int p_\theta(x_n, z) , dz \right) $$

Using Jensen’s inequality, we can introduce a lower bound (ELBO) on the log likelihood, which is computationally more tractable:

$$ \log p_\theta(x_n) \geq \mathbb{E}{Z \sim q(z)} \left[ \log \left(\frac{p\theta(x_n, Z)}{q(Z)}\right) \right] = \text{ELBO} $$

This inequality arises because the logarithm is a concave function, and taking the expectation inside the log provides a lower bound to the log of the expectation.

Importance of KL Divergence

The ELBO can also be expressed in terms of the Kullback-Leibler divergence between the proposal distribution $q(z)$ and the true posterior $p_\theta(z|x_n)$:

$$ \text{ELBO} = \mathbb{E}{Z \sim q(z)} \left[ \log p\theta(x_n, Z) – \log q(Z) \right] = \log p_\theta(x_n) – D_{KL}(q(Z) , || , p_\theta(Z|x_n)) $$

Maximizing the ELBO effectively minimizes the KL divergence, aligning the proposal distribution $q(z)$ more closely with the true posterior, which improves the approximation of the marginal likelihood.

Parameterization of $q$

When $q$ is parameterized as $q_\alpha(z|x_n)$, we optimize not only the model parameters $\theta$ but also the parameters $\alpha$ of the proposal distribution. This dual optimization helps in maximizing $\log p_\theta(x_n)$ and reducing the discrepancy between $q_\alpha(z|x_n)$ and $p_\theta(z|x_n)$.

By maximizing the ELBO, we approximate the true log likelihood as closely as possible, thus improving our model fit to the data.

Estimating the Marginal Distribution

In scenarios where we do not have an analytical form of the marginal probability $p_\theta(x_n)$ but only have the joint probability $p_\theta(x_n, z)$, we need a method to estimate the marginal probability. This can be achieved using the following technique:

$$
p_\theta(x_n) = \int p_\theta(x_n, z) \, dz
$$

However, directly computing this integral may not be feasible if the distribution over \(z\) is complex. Instead, we can use an importance sampling approach where \(z\) is sampled from an arbitrary distribution \(q(z)\), known as the proposal distribution. This leads to:

$$
p_\theta(x_n) = \int \frac{p_\theta(x_n, z)}{q(z)} q(z) \, dz = \mathbb{E}{Z \sim q(z)} \left[ \frac{p\theta(x_n, Z)}{q(Z)} \right]
$$

Here, the integral is approximated by the expected value under the distribution $q(z)$.

Categories: DEEP LEARNING

0 Comments

Leave a Reply