B.log

Random notes mostly on Machine Learning

Neural Samplers and Hierarchical Variational Inference

This post sets background for the upcoming post on my work on more efficient use of neural samplers for Variational Inference.

Variational Inference

At the core of Bayesian Inference lies the well-known Bayes' theorem, relating our prior beliefs p(z) with those obtained after observing some data x:

p(z|x)=p(x|z)p(z)p(x)=p(x|z)p(z)p(x,z)dz

However, in most practical cases the denominator p(x) requires intractable integration. Thus the field of Approximate Bayesian Inference seeks to efficiently approximate this posterior. For example, MCMC-based methods essentially use sample-based empirical distribution as an approximation.

In problems of learning latent variable models (for example, VAEs) we seek to do maximum likelihood learning for some hierarchical model pθ(x)=pθ(x,z)dz, but computing the integral is intractable and latent variables z are not observed.

Variational Inference is a method that gained a lot of popularity recently, especially due to its scalability. It nicely allows for simultaneous inference (finding the posterior approximate) and learning (optimizing parameters of the model) by means of the evidence lower bound (ELBO) on the marginal log-likelihood logpθ(x), obtained by applying importance sampling followed by Jensen's inequality:

logpθ(x)=logEpθ(z)pθ(x|z)=logEqϕ(z|x)pθ(x,z)qϕ(z|x)Eqϕ(z|x)logpθ(x,z)qϕ(z|x)=:L

This lower bound should be maximized w.r.t. both ϕ (variational parameters) and θ (model parameters). To better understand the effect of such optimization, it's helpful to consider the gap between the marginal log-likelihood and the bound. It's easy to show that this gap is equal to some Kullback-Leibler (KL) divergence:

logpθ(x)Eq(z|x)logpθ(x,z)qϕ(z|x)=DKL(qϕ(z|x)∣∣pθ(z|x))

Now it's easy to see that maximizing the ELBO w.r.t. ϕ tightens the bound and performs approximate inference -- q(z|x) becomes closer to the true posterior p(z|x) as measured by the KL divergence. While we hope that maximizig the bound w.r.t. θ increases marginal log-likelihood logpθ(x), this is obstructed by the KL-divergence. In a more realistic setting maximizing the ELBO is equivalent to maximizing the marginal log-likelihood regularized with the DKL(qϕ(z|x)∣∣pθ(z|x)), except there's no hyperparameter to control the strength of this regularization. This regularization prevents the true posterior pθ(z|x) from deviating too much from the variational distribution q(z|x), which is not bad, as you'd know that the true posterior has somewhat simple form, but on the other hand it prevents us from learning powerful and expressive models pθ(x)=pθ(x|z)pθ(z)dz. Therefore if we're after expressive models pθ(x), we probably should minimize such regularization effect, for example, by means of more expressive variational approximations.

Intuitively, tighter the bound -- lesser the regularizational effect is. And it's relatively easy to obtain a tighter bound: logpθ(x)=logEpθ(z)pθ(x|z)=logEqϕ(z1:K|x)1Kk=1Kpθ(x,zk)qϕ(zk|x)Eqϕ(z1:K|x)log(1Kk=1Kpθ(x,zk)qϕ(zk|x))=:LKL That is, by simply taking several samples to estimate the marginal likelihood pθ(x) under the logarithm, we made the bound tighter. Such bounds usually are called IWAE bounds (for Importance Weighted Autoencoders paper they were first introduced in), but we'll be calling these bounds Multisample Variational Lower Bounds. Such bounds were shown to correspond to using more expressive proposal distributions and are very powerful, but require multiple evaluations of the decoder pθ(x|z), which might be very expensive for complex models, for example, when applying VAEs to dialogue modelling.

An alternative direction is to use more expressive family of variational distributions qϕ(z|x). Moreover, with the explosion of Deep Learning we actually know one family of models that have empirically demonstrated terrific approximation capabilities -- Neural Networks. We therefore will consider so called Neural Samplers as generators of approximate posterior q(z|x) samples. A Neural Sampler is simply a neural network that is trained to take some simple (say, Gaussian) random variable ψq(ψ|x) and transform it into z that has the properties we seek. Canonical examples are GANs and VAEs and we'll get back to them later in the discussion.

And using neural nets is not a new idea. There's been a lot of research along this direction, which we might roughly classify into 3 directions based on how they deal with the intractable logqϕ(z|x) term:

  • Flows
  • Estimates
  • Bounds

I'll briefly cover the first two and then discuss the last one, which is of central relevance to this post.

Flows

So called Flow models appeared on the radar with the publication of the Normalizing Flows paper, and then quickly exploded into a hot topic of research. At the moment there exist dozens of works on all kinds of flows. The basic idea is that if the Neural Net defining the sampler is invertible, then by computing its Jacobian (the determinant of the Jacobi matrix) we can analytically find the density q(z|x). Flows further restrict the samplers to have efficiently computable Jacobians. For further reading refer to Adam Kosiorek's post.

Flows were shown to be very powerful, they even managed to model the high-dimensional data directly, as was shown by OpenAI researchers with Glow model. However, Flow-based model require a neural network specially designed to be invertible and have easy-to-compute Jacobian. Such restrictions might lead to inefficiency in parameter usage, requiring much more parameters and compute compared to simpler methods. The aforementioned Glow uses a lot of parameters and compute to learn modestly hi-res images.

Estimates

Another direction is to estimate qϕ(z|x)/p(z) by means of auxiliary models. For example, the Density Ratio Trick lying at the heart of many GANs say that if you have an optimal discriminator D(z,x) discerning samples from q(z|x) from those from p(z) (for the given x), then the following is true:

D(z,x)1D(z,x)=q(z|x)p(z)

In practice we do not have the optimal classifier, so instead we train auxiliary model to perform such classification. A particularly successful approach along this direction is the Adversarial Variational Bayes. Biggest advantage of this method is the lack of any restrictions on the Neural Sampler (except the standard requirement of differentiability). The disadvantage is that it loses all bound guarantees and inherits a lot of stability issues from GANs.

Bounds and Hierarchical Variational Inference

Arguably, the most natural approach to employing Neural Samplers as variational approximations is to give an efficient lower bound on the ELBO. In particular, we'd like to give a variational lower bound on the intractable term log1qϕ(z|x).

You can notice that for the Neural Sampler as described above the marginal density qϕ(z|x) has the form of qϕ(z|x)=qϕ(z|x,ψ)qϕ(ψ|x)dψ, very similr to that of VAE itself! Indeed, the Neural Sampler is a latent variable model like the VAE itself, except its conditioned on x. Great -- you might think -- we'll just reuse the bounds we have derived above, problem solved, right? Well, no. The problem is that we need to give a lower bound on negative marginal log-density, or equivalently, an upper bound on the marginal log-density.

But first we need to figure out one important question: what is qϕ(z|x,ψ)? In case of the GAN-like procedure we could say that this density is degenerate: qϕ(z|ψ,x)=δ(zfϕ(ψ,x)) where fϕ is the neural network that generates z from ψ. While the estimation-based approach is fine with this since it doesn't work with densities directly, for the bounds, however, we need qϕ(z|x,ψ) to be a well-defined density, so from now on we'll assume it's some proper density, not the delta function2.

Luckily, one can use the following identity

Eqϕ(ψ|z,x)τη(ψ|z,x)qϕ(z,ψ|x)=1qϕ(z|x)

Where τ(ψ|z,x) is arbitrary density we'll be calling auxiliary variational distribution. Then, by applying logarithm and the Jensen's inequality, we obtain a much needed variational upper bound:

logqϕ(z|x)Eqϕ(ψ|z,x)logqϕ(z,ψ|x)τη(ψ|z,x):=U

Except -- oops -- it needs a sample from the true inverse model qϕ(ψ|z,x), which in general is not any easier to obtain than to calculate the logqϕ(z) in the first place. Bummer? No -- turns out, we can use the fact that samples z are coming from the same hierarchical process qϕ(z,ψ|x)! Indeed, since we're interested in the logqϕ(z) averaged over all z|x: Eqϕ(z|x)logqϕ(z|x)Eqϕ(z|x)Eqϕ(ψ|z,x)logqϕ(z,ψ|x)τη(ψ|z,x)=Eqϕ(z,ψ|x)logqϕ(z,ψ|x)τη(ψ|z,x)=Eqϕ(ψ|x)Eqϕ(z|ψ,x)logqϕ(z,ψ|x)τη(ψ|z,x)

These algebraic manipulations show that if we sampled z through a hierarchical scheme, then ψ used to generate this z can be thought of as a free posterior sample1. This leads to the following lower bound on the ELBO, introduced in Hierarchical Variational Models paper:

logpθ(x)LEqϕ(z,ψ|x)logpθ(x,z)qϕ(z,ψ|x)τη(ψ|z,x) Interestingly, this bound admits another interpretation. Indeed, it can be equivalently represented as logpθ(x)Eqϕ(z,ψ|x)logpθ(x,z)τη(ψ|z,x)qϕ(z,ψ|x) Which is just ELBO for an extended model where the latent code z as extended with ψ, and since there was not ψ in the original model pθ(x,z), we extended the model as well with τη(ψ|z,x). This view has been investigated in the Auxiliary Deep Generative Models paper.

Let's now return to the variational upper bound U. Can we give a multisample variational upper bound on logqϕ(z|x) similar to IWAE? Well, following the same logic, we can arrive to the following:

log1qϕ(z|x)=logEqϕ(ψ1:K|z,x)1Kk=1Kτη(ψk|z,x)qϕ(z,ψk|x)Eqϕ(ψ1:K|z,x)log1Kk=1Kτη(ψk|z,x)qϕ(z,ψk|x) logqϕ(z|x)Eqϕ(ψ1:K|z,x)log11Kk=1Kτη(ψk|z,x)qϕ(z,ψk|x)

However, this bound -- Variational Harmonic Mean Estimator -- is no good as it uses K samples from the true inverse model qϕ(ψ|x,z) whereas we can have only one free sample. The rest have to be obtained through expensive MCMC sampling and that doesn't scale well. Interestingly, this estimator was already presented in the original VAE paper (though buried in the Appendix D), but discarded as too unstable.

Why multisample variational upper bound?

The gap between the ELBO and it's tractable lower bound can be shown to be LEqϕ(z,ψ|x)logpθ(x,z)qϕ(z,ψ|x)τη(ψ|z,x)=DKL(qϕ(ψ|x,z)∣∣τη(ψ|x,z)) So since we'll be using some simple τeta(ψ|x,z), we'll be restricting the true inverse model qϕ(ψ|x,z) to also be somewhat simple, limiting the expressivity of q(z|x), thus limiting the expressivity p(z|x)... Looks like we ended up with where we started, right? Well, not quite so, as we might have gained more than lost by moving the simple distribution from q(z|x) to τ(ψ|x,z), but still not quite satisfying. So having a multisample upper bound would allow us to give tighter bounds (which don't suffer from the regularization that much) and not invoke any additional model's decoder pθ(x|z) evaluations (see the Variational Harmonic Mean Estimator above as example).

So... Are there efficient multisample variational upper bounds? A year ago you might have thought the answer is "Probably no", until... [To be continued]


  1. This is not a new result, see Grosse et al., section 4.2, paragraph on "simulated data". 

  2. The problem is that delta function is not a real function, but a generalized function, and a special case has to be taken to deal with them.