As a member of Bayesian methods research group I’m heavily interested in Bayesian approach to machine learning. One of the strengths of this approach is ability to work with hidden (unobserved) variables which are interpretable. This power however comes at a cost of generally intractable exact inference, which limits the scope of solvable problems.

Another topic which gained lots of momentum in Machine Learning recently is Deep Learning, of course. With Deep Learning we can now build big and complex models that outperform most hand-engineered approaches given lots of data and computational power. The fact that Deep Learning needs a considerable amount of data also requires these methods to be scalable — a really nice property for any algorithm to have, especially in a Big Data epoch.

Given how appealing both topics are it’s not a surprise there’s been some work to marry these two recently. In this series of blogsposts I’d like to summarize recent advances, particularly in variational inference. This is not meant to be an introductory discussion as prior familiarity with classical topics (Latent variable models, Variational Inference, Mean-field approximation) is required, though I’ll introduce these ideas anyway just to remind it and setup the notation.

### Latent Variables Models

Suppose you have a probabilistic model that’s easy to describe using some auxiliary variables \(Z\) that you don’t observe directly (or even would like to infer it given the data). One classical example for this setup is Gaussian Mixture Modeling: we have \(K\) components in a mixture, and \(z_n\) is a one-hot vector of dimensionality \(K\) indicating which component an observation \(x_n\) belongs to. Then, conditioned on \(z_n\) the distribution of \(x_n\) is a usual Gaussian distribution: \(p(x_{n} \mid z_{nk} = 1) = \mathcal{N}(x_n \mid \mu_k, \Sigma_k)\) (here whenever I refer to a distribution, you should read it as its density. At least generalized one). Therefore the joint distribution of the model is

\[ p(x, z \mid \Theta) = \prod_{n=1}^N \prod_{k=1}^K \mathcal{N}(x_n \mid \mu_k, \Sigma_k)^{z_{nk}} \pi_k^{z_{nk}} \]

Where \(\pi\) is a probability distribution over \(K\) outcomes, and \(\Theta\) is a set of all model’s parameters (\(\pi\), \(\mu\)s and \(\Sigma\)s).

We’d like to do 2 things with the model: first, we obviously need to learn parameters \(\Theta\), and second, we’d like infer these latent variables \(z_n\) to know which cluster the observation \(x_n\) belongs to, that is, we need to calculate the distribution of \(z_n\) conditioned on \(x_n\): \(p(z_n \mid x_n)\).

We want to learn the parameters \(\Theta\) as usual by maximizing the log-likelihood. Unfortunately, we don’t know true assignments \(z_n\), and marginalizing over it as in \(p(x_n) = \sum_{k=1}^K \pi_k p(x_n, z_{nk} = 1)\) is not a good idea as the resulting optimization problem would lose its convexity. Instead we decompose the log-likelihood as follows:

\[ \begin{align} \log p(x) &= \mathbb{E}_{q(z\mid x)} \overbrace{\log p(x)}^{\text{const in $z$}} = \mathbb{E}_{q(z\mid x)} \log \frac{p(x, z) q(z\mid x)}{p(z \mid x) q(z\mid x)} \\ &= \mathbb{E}_{q(z\mid x)} \log \frac{p(x, z)}{q(z\mid x)} + D_{KL}(q(z\mid x) \mid\mid p(z \mid x)) \end{align} \]

The second term is a Kullback-Leibler divergence, which is always non-negative, and equals zero iff distributions are equal almost everywhere \(q(z\mid x) = p(z \mid x)\). Therefore putting \(q(z \mid x) = p(z \mid x)\) eliminates the second term, leaving us with \(\log p(x) = \mathbb{E}_{p(z \mid x)} \log \frac{p(x, z)}{p(z \mid x)}\). Therefore all we need to be able to do is to calculate the posterior \(p(z \mid x)\), and maximize the expectation. This is how EM algorithm is derived: at E-step we calculate the posterior \(p(z \mid x, \Theta^{\text{old}})\), and at M-step we maximize the expectation \(\mathbb{E}_{p(z \mid x, \Theta^{\text{old}})} \log \frac{p(x, z \mid \Theta)}{p(z \mid x, \Theta)}\) with respect to \(\Theta\) keeping \(\Theta^{\text{old}}\) fixed.

Now, all we are left to do is to find the posterior \(p(z \mid x)\) which is given by the following deceivingly simple formula knows as a Bayes’ rule.

\[ p(z \mid x) = \frac{p(x \mid z) p(z)}{\int p(x \mid z) p(z)dz} \]

Of course, there’s no free lunch and computing the denominator is intractable in general case. One **can** compute the posterior exactly when the prior \(p(z)\) and the likelihood \(p(x \mid z)\) are conjugate (that is, after multiplying the prior by the likelihood you get the same functional form for \(z\) as in the prior, thus the posterior comes from the same family as the prior but with different parameters), but many models of practical interest don’t have this property. This is where variational inference comes in.

### Variational Inference and Mean-field

In a variational inference (VI) framework we approximate the true posterior \(p(z \mid x)\) with some other simpler distribution \(q(z \mid x, \Lambda)\) where \(\Lambda\) is a set of (variational) parameters of the (variational) approximation (I may omit \(\Lambda\) and \(\Theta\) in a “given” clause when it’s convenient, remember, it always could be placed there). One possibility is to divide latent variables \(z\) in groups and force the groups to be independent. In extreme case each variable gets its own group, assuming independence among all variables \(q(z \mid x) = \prod_{d=1}^D q(z_d \mid x)\). If we then set about to find the best approximation to the true posterior in this fully factorized class, we will no longer have the optimal \(q\) being the true posterior itself, as the true posterior is presumably too complicated to be dealt with in analytic form (which we want from the approximation \(q\) when we say “simpler distribution”). Therefore we find the optimal \(q(z_i)\) by minimizing the KL-divergence with the true posterior (\(\text{const}\) denotes terms that are constant w.r.t. \(q(z_i)\)):

\[ \begin{align} D_{KL}(q(z \mid x) \mid\mid p(z \mid x)) &= \mathbb{E}_{q(z_i \mid x)} \left[ \mathbb{E}_{q(z_{-i} \mid x)} \log \frac{q(z_1 \mid x) \dots q(z_D \mid x)}{p(z \mid x)} \right] \\ &= \mathbb{E}_{q(z_i \mid x)} \left[ \log q(z_i \mid x) - \underbrace{\mathbb{E}_{q(z_{-i} \mid x)} \log p(z \mid x)}_{\log f(z_i \mid x)} \right] + \text{const} \\ &= \mathbb{E}_{q(z_i \mid x)} \left[ \log \frac{q(z_i \mid x)}{\tfrac{1}{Z} f(z_i \mid x)} \right] - \log Z + \text{const} \\ &= D_{KL}\left(q(z_i \mid x) \mid\mid \tfrac{1}{Z} f(z_i \mid x)\right) + \text{const} \end{align} \]

For many models it’s possible to look into \(\mathbb{E}_{q(z_{-i} \mid x)} \log p(z \mid x)\) and immediately recognize logarithm of unnormalized density of some known distribution.

Another cornerstone of this framework is a notion of **Evidence Lower Bound** (ELBO): recall the decomposition of log-likelihood we derived above. In our current setting we can not compute the right hand side as we can not evaluate the true posterior \(p(z \mid x)\). However, note that the left hand side (that is, the log-likelihood) does not depend on the variational distribution \(q(z \mid x, \Lambda)\). Therefore, maximizing the first term of the right hand side w.r.t. variational parameters \(\Lambda\) results in minimizing the second term, the KL-divergence with the true posterior. This implies we can ditch the second term, and maximize the first one w.r.t. both model parameters \(\Theta\) and variational parameters \(\Lambda\):

\[ \text{ELBO:} \quad \mathcal{L}(\Theta, \Lambda) := \mathbb{E}_{q(z \mid x, \Lambda)} \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)} \]

Okay, so this covers the basics, but before we go to the neural networks-based methods we need to discuss some general approaches to VI and how to make it scalable. This is what the next blog post is all about.