# Neural Variational Inference: Variational Autoencoders and Helmholtz machines

So far we had a little of “neural” in our VI methods. Now it’s time to fix it, as we’re going to consider Variational Autoencoders (VAE), a paper by D. Kingma and M. Welling, which made a lot of buzz in ML community. It has 2 main contributions: a new approach (AEVB) to large-scale inference in non-conjugate models with continuous latent variables, and a probabilistic model of autoencoders as an example of this approach. We then discuss connections to Helmholtz machines — a predecessor of VAEs.

### Auto-Encoding Variational Bayes

As noted in the introduction of the post, this approach, called Auto-Encoding Variational Bayes (AEVB) works only for some models with continuous latent variables. Recall from our discussion of Blackbox VI and Stochastic VI, we’re interested in maximizing the ELBO $$\mathcal{L}(\Theta, \Lambda)$$:

$\mathcal{L}(\Theta, \Lambda) = \mathbb{E}_{q(z\mid x, \Lambda)} \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)}$

It’s not a problem to compute an estimate of the gradient of the ELBO w.r.t. model parameters $$\Theta$$, but estimating the gradient w.r.t. approximation parameters $$\Lambda$$ is tricky as these parameters influence the distribution the expectation is taken over, and as we know from the post on Blackbox VI, naive gradient estimator based on score function exhibits high variance.

Turns out, for some distributions we can make change of variables, that is, for some distributions a sample $$z \sim q(z \mid x, \Lambda)$$ can be represented as a (differentiable) transformation $$g(\varepsilon; \Lambda, x)$$ of some auxiliary random variable $$\varepsilon$$ whose distribution does not depend on $$\Lambda$$. A well-known example of such reparametrization is Gaussian distribution: if $$z \sim \mathcal{N}(\mu, \Sigma)$$ then $$z$$ can be represented as $$z = g(\varepsilon; \mu, \Sigma) := \mu + \Sigma^{1/2} \varepsilon$$ for $$\varepsilon \sim \mathcal{N}(0, I)$$. This transformation is called the reparametrization trick. After the reparametrization the ELBO becomes

\begin{align} \mathcal{L}(\Theta, \Lambda) &= \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, I)} \log \frac{p(x, g(\varepsilon; \Lambda, x)\mid \Theta)}{q(g(\varepsilon; \Lambda, x) \mid \Lambda, x)} \\ &\approx \frac{1}{L} \sum_{l=1}^L \log \frac{p(x, g(\varepsilon^{(l)}; \Lambda, x)\mid \Theta)}{q(g(\varepsilon^{(l)}; \Lambda, x) \mid \Lambda, x)} \quad \quad \text{where \varepsilon^{(l)} \sim \mathcal{N}(0, I)} \end{align}

This objective is a much better one as we don’t need to differentiate w.r.t. expectation’s distribution, essentially putting the variational parameters $$\Lambda$$ to the same regime as the model parameters $$\Theta$$. It’s sufficient now to just take gradients of the ELBO’s estimate, and run any optimization algorithm like Adam.

Oh, and if you wonder what Auto-Encoding in Auto-Encoding Variational Bayes means, there’s an interesting interpretation of the ELBO in terms of autoencoding:

\begin{align} \mathcal{L}(\Theta, \Lambda) & = \mathbb{E}_{q(z\mid x, \Lambda)} \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)} = \mathbb{E}_{q(z\mid x, \Lambda)} \log \frac{p(x \mid z, \Theta) p(z \mid \Theta)}{q(z \mid x, \Lambda)} \\ & = \mathbb{E}_{q(z\mid x, \Lambda)} \log p(x \mid z, \Theta) - D_{KL}(q(z \mid \Lambda, x) \mid\mid p(z \mid \Theta)) \\ \end{align}

Here the first term can be treated as expected reconstruction ($$x$$ from the code $$z$$) error, while the second one is just a regularizer.

### Variational Autoencoder

One particular application of AEVB framework comes from using neural networks as the model $$p(x \mid z, \Theta)$$ (called generative network) and the approximation $$q(z \mid x, \Lambda)$$ (called inference network or recognition network). The model has no requirements, and $$x$$ can be discrete or continuous (or mixed). $$z$$, however, has to be continuous. Moreover, we need to be able to apply the reparametrization trick. Therefore in many practical applications $$q(z \mid x, \Lambda)$$ is set to be Gaussian distribution $$q(z \mid \Lambda, x) = \mathcal{N}(z \mid \mu(x; \Lambda), \Sigma(x; \Lambda))$$ where $$\mu$$ and $$\Sigma$$ are outputs of a neural network taking $$x$$ as input, and $$\Lambda$$ denotes a set of neural network’s weights — the parameters we optimize the ELBO with respect to (the same applies to $$\Theta$$). In order to make the reparametrization trick practical, one would like to be able to compute $$\Sigma^{1/2}$$ quick. We don’t want to actually compute this quantity directly as it’d be too computationally expensive. Instead you might want to predict $$\Sigma^{1/2}$$ by the neural network in the first place, or consider only diagonal covariance matrices (as it’s done in the paper).

In case of the Gaussian inference network $$q(z \mid x, \Lambda)$$ and a Gaussian prior $$p(z \mid \Theta)$$ we can compute KL-divergence $$D_{KL}(q(z \mid \Lambda, x) \mid\mid p(z \mid \Theta))$$ analytically, see the formula at stats.stackexchange. This slightly reduces the variance of the gradient estimator, though one can still train a VAE estimating KL-divergence using Monte Carlo, just like the reconstruction error.

We optimize both generative and inference networks by gradient ascent. This joint optimization pushes both the approximation towards the model, and the model towards the approximation. As a result, the generative network is encouraged to learn latent representations $$z$$ that exhibit the same independence pattern as the inference network. For example, if the inference network is Gaussian and has diagonal covariance matrices, then the generative model will try to learn representations with independent components.

VAEs have become popular because one can use it as a generative model. Essentially VAE is an easy to train autoencoder with a natural sampling procedure 1: suppose you’ve trained the model, and now want to sample new samples similar to those you used in the training set. To do so you first sample $$z$$ from the prior $$p(z)$$, and then generate $$x$$ using the model $$p(x \mid z, \Theta)$$. Both operations are easy: the first one is sampling from some standard distribution (like Gaussian, for example), and the second one is just one feed-forward pass followed by sampling from another standard distribution (Bernoulli, for example, in case $$x$$ is a binary image).

If you want to read more on Variational Auto-Encoders, I refer you to a great tutorial by Carl Doersch. Also take a look at Dustin Tran’s post Variational auto-encoders do not train complex generative models (and see the reddit discussion also!).

### Helmholtz Machines

In the end I’d like to add a historical perspective. The idea of two networks, one “encoding” an observation $$x$$ to some latent representation (code) $$z$$, and another “decoding” it back is definitely not new. In fact, the whole idea is a special case of the Helmholtz Machines introduced by Geoffrey Hinton 20 years ago.

Helmholtz machine can be thought of as a neural network of stochastic hidden layers. Namely, we now have $$M$$ stochastic hidden layers (latent variables) $$h_1, \dots, h_M$$ (with deterministic $$h_0 = x$$) where the layer $$h_{m-1}$$ is stochastically produced by the layer $$h_{m}$$, that is, it is samples from some distribution $$p(h_{m-1} \mid h_m)$$, which as you might have guessed already is parametrized in the same way as in usual VAEs. Actually, VAEs is a special case of a Helmholtz machine with just one stochastic layer (but each stochastic layer contains a neural network of arbitrarily many deterministic layers inside of it).

This image shows an instance of a Helmholtz machine with 2 stochastic layers (blue cloudy nodes), and each stochastic layer having 2 deterministic hidden layers (white rectangles).

The joint model distribution is

$p(x, h_1, \dots, h_M \mid \Theta) = p(h_M \mid \Theta) \prod_{m=0}^{M-1} p(h_m \mid h_{m+1}, \Theta)$

And the approximate posterior is the same, but in inverse order:

$q(h_1, \dots, h_M \mid x, \Lambda) = \prod_{m=1}^{M} p(h_{m} \mid h_{m-1}, \Theta)$

The $$p(x, h_1, \dots, h_{M-1} \mid h_M)$$ distribution is usually called a generative network (or model) as it allows one to generate samples from latent representation(s). The approximate posterior $$q(h_1, \dots, h_M \mid x, \Lambda)$$ in this framework is called a recognition network (or model). Presumably, the name reflects the purpose of the network to recognize the hidden structure of observations.

So, if the VAE is a special case of Helmholtz machines, what’s new then? The standard algorithm for learning Helmholtz machines, the Wake-Sleep algorithm, turns out to be optimizing a different objective. Thus, one of significant contributions of Kingma and Welling is application of the reparametrization trick to make optimization of the ELBO w.r.t. $$\Lambda$$ tractable.

1. This is not true for other popular autoencoding architectures. Boltzman Machines are too hard to train properly, while tranditional autoencoders (contractive, denoising) are hard to sample from (special procedures involving MCMC are required).