This post sets background for the upcoming post on my work on more efficient use of neural samplers for Variational Inference.

## Variational Inference

At the core of *Bayesian Inference* lies the well-known Bayes’ theorem, relating our prior beliefs \(p(z)\) with those obtained after observing some data \(x\):

\[ p(z|x) = \frac{p(x|z) p(z)}{p(x)} = \frac{p(x|z) p(z)}{\int p(x, z) dz} \]

However, in most practical cases the denominator \(p(x)\) requires intractable integration. Thus the field of Approximate Bayesian Inference seeks to efficiently approximate this posterior. For example, MCMC-based methods essentially use sample-based empirical distribution as an approximation.

In problems of *learning* latent variable models (for example, VAEs) we seek to do maximum likelihood learning for some hierarchical model \(p_\theta(x) = \int p_\theta(x, z) dz\), but computing the integral is intractable and latent variables \(z\) are not observed.

Variational Inference is a method that gained a lot of popularity recently, especially due to its scalability. It nicely allows for simultaneous inference (finding the posterior approximate) and learning (optimizing parameters of the model) by means of the *evidence lower bound* (ELBO) on the *marginal log-likelihood* \(\log p_\theta(x)\), obtained by applying importance sampling followed by Jensen’s inequality:

\[ \log p_\theta(x) = \log \mathbb{E}_{p_\theta(z)} p_\theta(x|z) = \log \mathbb{E}_{q_\phi(z|x)} \frac{p_\theta(x, z)}{q_\phi(z|x)} \ge \mathbb{E}_{q_\phi(z|x)} \log \frac{p_\theta(x, z)}{q_\phi(z|x)} =: \mathcal{L} \]

This lower bound should be maximized w.r.t. both \(\phi\) (variational parameters) and \(\theta\) (model parameters). To better understand the effect of such optimization, it’s helpful to consider the gap between the marginal log-likelihood and the bound. It’s easy to show that this gap is equal to some Kullback-Leibler (KL) divergence:

\[ \log p_\theta(x) - \mathbb{E}_{q(z|x)} \log \frac{p_\theta(x,z)}{q_\phi(z|x)} = D_{KL}(q_\phi(z|x) \mid\mid p_\theta(z|x)) \]

Now it’s easy to see that maximizing the ELBO w.r.t. \(\phi\) tightens the bound and performs approximate inference – \(q(z|x)\) becomes closer to the true posterior \(p(z|x)\) as measured by the KL divergence. While we hope that maximizig the bound w.r.t. \(\theta\) increases marginal log-likelihood \(\log p_\theta(x)\), this is obstructed by the KL-divergence. In a more realistic setting maximizing the ELBO is equivalent to maximizing the marginal log-likelihood regularized with the \(D_{KL}(q_\phi(z|x) \mid\mid p_\theta(z|x))\), except there’s no hyperparameter to control the strength of this regularization. This regularization prevents the true posterior \(p_\theta(z|x)\) from deviating too much from the variational distribution \(q(z|x)\), which is not bad, as you’d know that the true posterior has somewhat simple form, but on the other hand it prevents us from learning powerful and expressive models \(p_\theta(x) = \int p_\theta(x|z) p_\theta(z) dz\). Therefore if we’re after expressive models \(p_\theta(x)\), we probably should minimize such regularization effect, for example, by means of more expressive variational approximations.

Intuitively, tighter the bound – lesser the regularizational effect is. And it’s relatively easy to obtain a tighter bound: \[
\begin{align*}
\log p_\theta(x)
&= \log \mathbb{E}_{p_\theta(z)} p_\theta(x|z)
= \log \mathbb{E}_{q_\phi(z_{1:K}|x)} \frac{1}{K} \sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)} \\
&\ge \mathbb{E}_{q_\phi(z_{1:K}|x)} \log\left( \frac{1}{K} \sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)} \right)
=: \mathcal{L}_K
\ge \mathcal{L}
\end{align*}
\] That is, by simply taking several samples to estimate the marginal likelihood \(p_\theta(x)\) under the logarithm, we made the bound tighter. Such bounds usually are called IWAE bounds (for Importance Weighted Autoencoders paper they were first introduced in), but we’ll be calling these bounds *Multisample Variational Lower Bounds*. Such bounds were shown to correspond to using more expressive proposal distributions and are very powerful, but require multiple evaluations of the decoder \(p_\theta(x|z)\), which might be very expensive for complex models, for example, when applying VAEs to dialogue modelling.

An alternative direction is to use more expressive family of variational distributions \(q_\phi(z|x)\). Moreover, with the explosion of Deep Learning we actually know one family of models that have empirically demonstrated terrific approximation capabilities – Neural Networks. We therefore will consider so called Neural Samplers as generators of approximate posterior \(q(z|x)\) samples. A *Neural Sampler* is simply a neural network that is trained to take some simple (say, Gaussian) random variable \(\psi \sim q(\psi|x)\) and transform it into \(z\) that has the properties we seek. Canonical examples are GANs and VAEs and we’ll get back to them later in the discussion.

And using neural nets is not a new idea. There’s been a lot of research along this direction, which we might roughly classify into 3 directions based on how they deal with the intractable \(\log q_\phi(z|x)\) term:

- Flows
- Estimates
- Bounds

I’ll briefly cover the first two and then discuss the last one, which is of central relevance to this post.

### Flows

So called Flow models appeared on the radar with the publication of the Normalizing Flows paper, and then quickly exploded into a hot topic of research. At the moment there exist dozens of works on all kinds of flows. The basic idea is that if the Neural Net defining the sampler is invertible, then by computing its Jacobian (the determinant of the Jacobi matrix) we can analytically find the density \(q(z|x)\). Flows further restrict the samplers to have efficiently computable Jacobians. For further reading refer to Adam Kosiorek’s post.

Flows were shown to be very powerful, they even managed to model the high-dimensional data directly, as was shown by OpenAI researchers with Glow model. However, Flow-based model require a neural network specially designed to be invertible and have easy-to-compute Jacobian. Such restrictions might lead to inefficiency in parameter usage, requiring much more parameters and compute compared to simpler methods. The aforementioned Glow uses a lot of parameters and compute to learn modestly hi-res images.

### Estimates

Another direction is to estimate \(q_\phi(z|x)/p(z)\) by means of auxiliary models. For example, the Density Ratio Trick lying at the heart of many GANs say that if you have an optimal discriminator \(D^*(z, x)\) discerning samples from \(q(z|x)\) from those from \(p(z)\) (for the given \(x\)), then the following is true:

\[ \frac{D^*(z, x)}{1 - D^*(z, x)} = \frac{q(z|x)}{p(z)} \]

In practice we do not have the optimal classifier, so instead we train auxiliary model to perform such classification. A particularly successful approach along this direction is the Adversarial Variational Bayes. Biggest advantage of this method is the lack of any restrictions on the Neural Sampler (except the standard requirement of differentiability). The disadvantage is that it loses all bound guarantees and inherits a lot of stability issues from GANs.

## Bounds and Hierarchical Variational Inference

Arguably, the most natural approach to employing Neural Samplers as variational approximations is to give an efficient lower bound on the ELBO. In particular, we’d like to give a variational lower bound on the intractable term \(\log \tfrac{1}{q_\phi(z|x)}\).

You can notice that for the Neural Sampler as described above the marginal density \(q_\phi(z|x)\) has the form of \(q_\phi(z|x) = \int q_\phi(z|x, \psi) q_\phi(\psi|x) d\psi\), very similr to that of VAE itself! Indeed, the Neural Sampler is a latent variable model like the VAE itself, except its conditioned on \(x\). Great – you might think – we’ll just reuse the bounds we have derived above, problem solved, right? Well, no. The problem is that we need to give a lower bound on **negative** marginal log-density, or equivalently, an upper bound on the marginal log-density.

But first we need to figure out one important question: what is \(q_\phi(z|x, \psi)\)? In case of the GAN-like procedure we could say that this density is degenerate: \(q_\phi(z|\psi, x) = \delta(z - f_\phi(\psi, x))\) where \(f_\phi\) is the neural network that generates \(z\) from \(\psi\). While the estimation-based approach is fine with this since it doesn’t work with densities directly, for the bounds, however, we need \(q_\phi(z|x, \psi)\) to be a well-defined density, so from now on we’ll assume it’s some proper density, not the delta function^{1}.

Luckily, one can use the following identity

\[ \mathbb{E}_{q_\phi(\psi|z, x)} \frac{\tau_\eta(\psi|z, x)}{q_\phi(z, \psi|x)} = \frac{1}{q_\phi(z|x)} \]

Where \(\tau(\psi|z, x)\) is arbitrary density we’ll be calling *auxiliary variational distribution*. Then, by applying logarithm and the Jensen’s inequality, we obtain a much needed variational upper bound:

\[ \log q_\phi(z|x) \le \mathbb{E}_{q_\phi(\psi|z, x)} \log \frac{q_\phi(z, \psi|x)}{\tau_\eta(\psi|z, x)} := \mathcal{U} \]

Except – oops – it needs a sample from the true inverse model \(q_\phi(\psi|z, x)\), which in general is not any easier to obtain than to calculate the \(\log q_\phi(z)\) in the first place. Bummer? No – turns out, we can use the fact that samples \(z\) are coming from the same hierarchical process \(q_\phi(z, \psi|x)\)! Indeed, since we’re interested in the \(\log q_\phi(z)\) averaged over all \(z|x\): \[ \begin{align*} \mathbb{E}_{q_\phi(z|x)} \log q_\phi(z|x) &\le \mathbb{E}_{q_\phi(z|x)} \mathbb{E}_{q_\phi(\psi|z, x)} \log \frac{q_\phi(z, \psi|x)}{\tau_\eta(\psi|z, x)} \\ & = \mathbb{E}_{q_\phi(z, \psi|x)} \log \frac{q_\phi(z, \psi|x)}{\tau_\eta(\psi|z, x)} \\ &= \mathbb{E}_{q_\phi(\psi|x)} \mathbb{E}_{q_\phi(z|\psi, x)} \log \frac{q_\phi(z, \psi|x)}{\tau_\eta(\psi|z, x)} \end{align*} \]

These algebraic manipulations show that if we sampled \(z\) through a hierarchical scheme, then \(\psi\) used to generate this \(z\) can be thought of as a free posterior sample^{2}. This leads to the following lower bound on the ELBO, introduced in Hierarchical Variational Models paper:

\[ \log p_\theta(x) \ge \mathcal{L} \ge \mathbb{E}_{q_\phi(z, \psi|x)} \log \frac{p_\theta(x, z)}{ \tfrac{q_\phi(z, \psi|x)}{\tau_\eta(\psi|z, x)} } \] Interestingly, this bound admits another interpretation. Indeed, it can be equivalently represented as \[ \log p_\theta(x) \ge \mathbb{E}_{q_\phi(z, \psi|x)} \log \frac{p_\theta(x, z) \tau_\eta(\psi|z, x)}{q_\phi(z, \psi|x) } \] Which is just ELBO for an extended model where the latent code \(z\) as extended with \(\psi\), and since there was not \(\psi\) in the original model \(p_\theta(x, z)\), we extended the model as well with \(\tau_\eta(\psi|z, x)\). This view has been investigated in the Auxiliary Deep Generative Models paper.

Let’s now return to the variational upper bound \(\mathcal{U}\). Can we give a multisample variational upper bound on \(\log q_\phi(z|x)\) similar to IWAE? Well, following the same logic, we can arrive to the following:

\[ \begin{align*} \log \frac{1}{q_\phi(z|x)} & = \log \mathbb{E}_{q_\phi(\psi_{1:K}|z, x)} \frac{1}{K} \sum_{k=1}^K \frac{\tau_\eta(\psi_k|z, x)}{q_\phi(z, \psi_k|x)} \\ &\ge \mathbb{E}_{q_\phi(\psi_{1:K}|z, x)} \log \frac{1}{K} \sum_{k=1}^K \frac{\tau_\eta(\psi_k|z, x)}{q_\phi(z, \psi_k|x)} \end{align*} \] \[ \log q_\phi(z|x) \le \mathbb{E}_{q_\phi(\psi_{1:K}|z, x)} \log \frac{1}{\frac{1}{K} \sum_{k=1}^K \frac{\tau_\eta(\psi_k|z, x)}{q_\phi(z, \psi_k|x)}} \]

However, this bound – Variational Harmonic Mean Estimator – is no good as it uses \(K\) samples from the true inverse model \(q_\phi(\psi|x,z)\) whereas we can have only one free sample. The rest have to be obtained through expensive MCMC sampling and that doesn’t scale well. Interestingly, this estimator was already presented in the original VAE paper (though buried in the Appendix D), but discarded as too unstable.

### Why multisample variational upper bound?

The gap between the ELBO and it’s tractable lower bound can be shown to be \[ \mathcal{L} - \mathbb{E}_{q_\phi(z, \psi|x)} \log \frac{p_\theta(x, z)}{ \tfrac{q_\phi(z, \psi|x)}{\tau_\eta(\psi|z, x)} } = D_{KL}(q_\phi(\psi|x,z) \mid\mid \tau_\eta(\psi|x,z)) \] So since we’ll be using some simple \(\tau_eta(\psi|x,z)\), we’ll be restricting the true inverse model \(q_\phi(\psi|x,z)\) to also be somewhat simple, limiting the expressivity of \(q(z|x)\), thus limiting the expressivity \(p(z|x)\)… Looks like we ended up with where we started, right? Well, not quite so, as we might have gained more than lost by moving the simple distribution from \(q(z|x)\) to \(\tau(\psi|x,z)\), but still not quite satisfying. So having a multisample upper bound would allow us to give tighter bounds (which don’t suffer from the regularization that much) and not invoke any additional model’s decoder \(p_\theta(x|z)\) evaluations (see the Variational Harmonic Mean Estimator above as example).

So… Are there efficient multisample variational upper bounds? A year ago you might have thought the answer is “Probably no”, until… [To be continued]

The problem is that delta function is not a real function, but a generalized function, and a special case has to be taken to deal with them.↩

This is not a new result, see Grosse et al., section 4.2, paragraph on “simulated data”.↩