Random notes mostly on Machine Learning

Thoughts on Mutual Information: More Estimators

In this post I'd like to show how Self-Normalized Importance Sampling (IWHVI and IWAE) and Annealed Importance Sampling can be used to give (sometimes sandwich) bounds on the MI in many different cases.

Mutual Information (MI) is an important concept from the Information Theory that captures the idea of information one random variable $X$ carries about the r.v. $Z$ and is usually denoted $I(X, Z)$, however in order to emphasize the underlying joint distribution I'll be using a non-standard notation $\text{MI}[p(x,z)]$. Formally the MI has the following definition:

$$ \text{MI}[p(x,z)] := \E_{p(x, z)} \log \frac{p(x, z)}{p(x) p(z)} = \E_{p(x, z)} \log \frac{p(x \mid z)}{p(x)} $$

Having such a nice information-measuring interpretation, MI is a natural objective and/or metric in many problems in Machine Learning. One particular application is evaluatiuon of encoder-decoder-like architectures as MI quantifies amount of information contained in the code. In particular, in Variational Autoencoders a good decoder should have high $\text{MI}[p(x,z)]$, meaning the code is very useful for generations. A good encoder though... has to keep balance between providing enough information to the decoder, while not deviating too much from the prior by, for example, encoding redundant / unnecessary information. 1

MI Estimation

In general estimating the MI is intractable as it requires knowing the log marginal density $\log p(x)$ or the intractable log posterior $\log p(z|x)$. However, there are many efficient variational bounds which can be employed to give tractable lower or upper bounds on the MI. Many existing bounds are reviewed in the On Variational Bounds of Mutual Information paper.

It is important to take into account what we know about the joint distribution of $x$ and $z$. We'll consider several nested "layers" of decreasing complexity:

  1. Blackbox case: Distributions $p(x, z)$ we can only sample from, but don't know any densities. This way we can form Monte Carlo estimates after some learning and surprisingly we can give some lower bounds already in this case.
  2. Known conditional case: Distributions $p(x, z)$ we can sample from and know one conditional distribution, say, $p(x|z)$.
  3. Known marginal case: Distributions $p(x, z)$ we can sample from and know one marginal distribution, say, $p(z)$.
  4. Known joint case: Distributions $p(x, z)$ we can sample from and know both a marginal and a conditional distributions, say, $p(z)$ and $p(x|z)$.
  5. Known everything case: Distributions $p(x, z)$ which we can sample from and know all conditionals and marginals. This is a trivial case and doesn't require any bounds. The MI can be estimated using Monte Carlo directly, so we'll omit it from the discussion.

Bounds based on Self-Normalized Importance Sampling

Self-Normalized Importance Sampling (SNIS) has been shown (see my previous posts) to give both lower and upper bounds on the marginal log-likelihood:

$$ \begin{align*} \text{IWAE}:& \quad\quad\quad \log p(x) \ge \E_{q(z_{1:K}|x)} \log \frac{1}{K} \sum_{k=1}^K \frac{p(x,z_k)}{q(z_k|x)} \\ \text{IWHVI}:& \quad\quad\quad \log p(x) \le \E_{p(z_{0}|x)} \E_{q(z_{1:K}|x)} \log \frac{1}{K+1} \sum_{k=0}^K \frac{p(x,z_k)}{q(z_k|x)} \end{align*} $$

We use such bounds to give sandwich bound on the intractable entropy term in the MI. Another useful insight is that

$$ \omega(z_{0:K}|x) := \frac{\hat\rho(x, z_0) \tau(z_{1:K}|x)}{\frac{1}{K+1} \sum_{k=0}^K \frac{\hat\rho(x, z_k)}{\tau(z_k|x)}} $$

is a distribution (a valid pdf, to be precise) for almost any (the only condition is the same as in the standard importance sampling) unnormalized distribution $\hat\rho(x, z)$ (a joint distribution over $z$ and $x$) and a normalized distribution $\tau(z|x)$ (a distribution over $z$ possibly conditioned on $x$). The fact that $\omega(z_{0:K}|x)$ is a valid distribution allows us to consider the following KL divergence:

$$ \begin{align*} 0 \le \text{KL}(p(x, z_0) \tau(z_{1:K}|x) \mid\mid p(x) \omega(z_{0:K}|x)) = \E_{p(x, z_0)} \E_{\tau(z_{1:K}|x)} \log \frac{p(x, z_0) \tau(z_{1:K}|x)}{p(x) \omega(z_{0:K}|x)} \end{align*} $$

Which gives the following lower bound on the MI: $$ \begin{align*} \E_{p(x, z_0)} \E_{\tau(z_{1:K}|x)} \log \frac{p(x | z_0)}{p(x)} & \ge \E_{p(x, z_0)} \E_{\tau(z_{1:K}|x)} \log \frac{\omega(z_{0:K}|x)}{p(z_0) \tau(z_{1:K}|x)} \\ & = \E_{p(x, z_0)} \E_{\tau(z_{1:K}|x)} \log \frac{\hat\rho(x, z_0)}{\frac{1}{K+1} \sum_{k=0}^K \frac{\hat\rho(x, z_k)}{\tau(z_k|x)}} - \E_{p(z)} \log p(z) \end{align*} $$

Equivalently, by reparametrizing the bound in terms of $\hat\varrho(x|z) = \hat\rho(x, z) / p(z)$ we have $$ \begin{align*} \E_{p(x, z_0)} \E_{\tau(z_{1:K}|x)} \log \frac{p(x | z_0)}{p(x)} \ge \E_{p(x, z_0)} \E_{\tau(z_{1:K}|x)} \log \frac{\hat\varrho(x | z_0)}{\frac{1}{K+1} \sum_{k=0}^K \hat\varrho(x | z_k) \frac{p(z_k)}{\tau(z_k|x)}} \end{align*} $$

These lower bounds work for any $\rho$ and $\varrho$, and the optimal choices are $p(x, z)$ and $p(x|z)$, correspondingly.

Known Joint

First, we'll consider the easiest case -- when we know the joint distribution in the form of a prior + conditional. A prominent example of this class is VAEs decoders, which is defined by some prior in the latent space $p(z)$ and a decoder $p_\theta(x|z)$ that uses a neural network to generate a distribution over observations $x$ for a particular $z$. Computing the MI of the decoder is arguably a natural way to measure the extent of posterior collapse2, since MI can be expressed in the following form, which essentially measures the true posterior's deviation from the prior:

$$ \text{MI}[p(x, z)] := \E_{p(x, z)} \log \frac{p(z|x)}{p(z)} = \E_{p(x)} D_{KL}(p(z|x) \mid\mid p(z)) $$

However the MI as introduced above requires knowing marginal $p(x)$, which is intractable. Luckily, the Multisample Variational Bounds allow us to give efficient variational sandwich bounds on MI (where $q_\phi(z|x)$ is an encoder with known density4):

$$ \E_{\substack{p_\theta(x, z_0) \\ q_\phi(z_{1:K}|x)} } \log \frac{p_\theta(x|z_0)}{\frac{1}{K+1} \sum_{k=0}^K \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)}} \le \text{MI}[p_\theta(x, z)] %= \mathbb{E}_{p_\theta(x, z)} \left[ \log p_\theta(x|z) - \log p_\theta(x) \right] \le \E_{\substack{p_\theta(x, z_0) \\ q_\phi(z_{1:K}|x)} } \log \frac{p_\theta(x|z_0)}{\frac{1}{K} \sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)}} $$

Known Conditional

However, we might not have any of marginal densities in the closed form. For example, this is the case if we want to estimate the amount of information the encoder $q_\phi(z|x)$ puts into the code $z$. Since it only defines the conditional distribution, we pair it with some data-generating process $p(x)$ which defines the population we'd like to evaluate the encoder over.

Direct application of the aforementioned bounds leads to (where $\tau_\eta(x_k|z)$ is our variational inverse distribution) $$ \E_{\substack{p(x_0) \\ q_\phi(z | x_0) \\ \tau_\eta(x_{1:K}|z)}} \!\!\! \log \frac{q_\phi(z|x_0)}{\frac{1}{K+1} \sum\limits_{k=0}^K \frac{q_\phi(z|x_k) p(x_k)}{\tau_\eta(x_k|z)}} \le \text{MI}[q_\phi(z | x) p(x)] \le \!\!\! \E_{\substack{p(x_0) \\ q_\phi(z | x_0) \\ \tau_\eta(x_{1:K}|z)}} \!\!\! \log \frac{q_\phi(z|x_0)}{\frac{1}{K} \sum\limits_{k=1}^K \frac{q_\phi(z|x_k) p(x_k)}{\tau_\eta(x_k|z)}} $$

However, we might not have an access to the density $p(x)$. In this case one can resort to SIVI-like bounds by setting $\tau_\eta(x|z) = p(x)$ and arrive to the following bounds: $$ \E_{\substack{p(x_{0:K}) \\ q_\phi(z | x_0)}} \log \frac{q_\phi(z|x_0)}{\frac{1}{K+1} \sum\limits_{k=0}^K q_\phi(z|x_k)} \le \text{MI}[q_\phi(z | x) p(x)] \le \E_{\substack{p(x_{0:K}) \\ q_\phi(z | x_0)}} \log \frac{q_\phi(z|x_0)}{\frac{1}{K} \sum\limits_{k=1}^K q_\phi(z|x_k)} $$

These bounds are known as special case of InfoNCE bounds and are much worse due to uninformed proposal $\tau$.

Known Prior

Sometimes we use complex implicit models as decoders and don't have closed-form densities for $p(x|z)$. The most popular instance of such models is Generative Adversarial Networks, which is similar to VAE with an exception of not having a well-defined decoder's density $p(x|z)$ (but the prior $p(z)$ is typically simple and known). In some sense, this density is degenerate: $p(x|z) = \delta(x - f(z))$ where $f$ is generator's neural network. Unfortunately, we cannot use such density in the IWHVI bounds. Are we doomed then? Turns out, not quite so. Even in this case it's still possible to give an efficient multisample variational lower bound: $$ \text{MI}[p(x, z)] \ge \E_{p(x, z_0)} \E_{q_\phi(z_{1:K}|x)} \log \frac{\hat\rho_\eta(x|z_0)}{\frac{1}{K+1} \sum_{k=0}^K \hat\rho_\eta(x|z_k) \frac{p(z_k)}{q_\phi(z_k|x)}} $$

Where $\hat\rho_\eta(x|z) = \exp(h_\eta(x|z))$ is any non-normalized energy-based model that essentially estimates the unknown density $p(x|z)$.

I am not aware of any good upper bounds and if they are possible. I would think the answer is negative due to hardness of upper-bounding the crossentropy in the black-box case.

Known Nothing

Staying in the realm of implicit models, let's assume we trained an implicit inference model (think of GANs with encoders) and would like estimate the MI much like in the case of VAE's encoder in the "known conditional" section. Denoting our inference model $q(z|x)$ and the data-generating process $p(x)$ (both densities are unknown to us, but we can sample from them), we can adapt the previous section's lower bound by choosing the proposal $\tau_\eta(x|z) = p(x)$ $$ \text{MI}[q(z | x) p(x)] \ge \E_{p(x_{0:K})} \E_{q(z|x_0)} \log \frac{\hat\rho_\eta(z|x_0)}{\frac{1}{K+1} \sum_{k=0}^K \hat\rho_\eta(z|x_k)} $$

Where $\hat\rho_\eta(z|x) = \exp(h_\eta(z|x))$ is again a non-normalized energy-based model that estimates the unknown density $q(z|x)$.

While convenient in its wide applicability, this bound is known to be very loose in cases when the true MI is high. We'll discuss drawbacks and limitations in the next post on the topic.

Bounds based on Annealed Importance Sampling

SNIS is not the only way to obtain variational sandwich bounds on log marginal likelihood. Another widely known and powerful approach is Annealed Importance Sampling (AIS)

AIS uses two distributions, called forward and backward: $$ q_\rightarrow(z_{1:T} | x) = q(z_1|x) \mathcal{T}_2(z_2 \mid z_1, x) \cdots \mathcal{T}_{T}(z_{T} \mid z_{T-1}, x) $$ $$ q_\leftarrow(z_{T:1} | x) = p(z_{T}|x) \mathcal{T}_{T}(z_{T-1} \mid z_{T}, x) \cdots \mathcal{T}_{2}(z_{1} \mid z_{2}, x) $$ $$ q_\leftarrow(x, z_{T:1}) = p(x, z_{T}) \mathcal{T}_{T}(z_{T-1} \mid z_{T}, x) \cdots \mathcal{T}_{2}(z_{1} \mid z_{2}, x) $$

Where $\mathcal{T}_t$ is a transition operator that is designed to be invariant to $p_t(z|x) \propto q(z|x)^{1-\beta_t} p(x, z)^{\beta_t}$ and $\beta_{1:T+1}$ is a monotonically increasing sequence s.t. $\beta_1 = 0$ and $\beta_{T+1} = 1$. That is, in the forward distribution $q_\rightarrow(z_{1:T} | x)$ one starts with a sample $z_1$ from some proposal $q(z|x)$ and then transforms it into a sample from $p_2(z|x)$ using the $\mathcal{T}_t$ transition operator (typically a MCMC kernel). The sample $z_2$ is then analogously transformed into $z_3$ amd so on. The backward distribution $q_\rightarrow(z_{T:1} | x)$ is similar except it starts with the true posterior sample $z_T \sim p(z|x)$ and then sequentially transforms it into a sample from the proposal $z_1 \sim q(z|x)$.

Then one defines the importance weight

$$ \begin{align*} w(z_{1:T} \mid x) &= \frac{q_\leftarrow(z_{T:1} | x)}{q_\rightarrow(z_{1:T} | x)} = \frac{\hat{p}_2(z_1 \mid x)} {q(z_1|x)} \frac{\hat{p}_3(z_2 \mid x)} {\hat{p}_2(z_2 \mid x)} \cdots \frac{p(x, z_{T})} {\hat{p}_{T}(z_{T} \mid x)} \\ &= % \left( \tfrac{p(x, z)}{q(z|x)} \right)^{\beta_t} q(z|x) \frac{ \left( \tfrac{p(x, z_1)}{q(z_1|x)} \right)^{\beta_2} q(z_1|x) }{q(z_1|x)} \frac{ \left( \tfrac{p(x, z_2)}{q(z_2|x)} \right)^{\beta_3} q(z_2|x) }{ \left( \tfrac{p(x, z_2)}{q(z_2|x)} \right)^{\beta_2} q(z_2|x) } \cdots \frac{ \left( \tfrac{p(x, z_{T})}{q(z_{T}|x)} \right)^{\beta_{T+1}} q(z_{T}|x) }{ \left( \tfrac{p(x, z_{T})}{q(z_{T}|x)} \right)^{\beta_{T}} q(z_{T}|x) } \\ &= \left( \tfrac{p(x, z_1)}{q(z_1|x)} \right)^{\beta_2 - \beta_1} \left( \tfrac{p(x, z_2)}{q(z_2|x)} \right)^{\beta_3 - \beta_2} \cdots \left(\tfrac{p(x, z_{T})}{q(z_{T}|x)} \right)^{\beta_{T+1} - \beta_{T}} \\ \end{align*} $$ Where the second identity is due to $\mathcal{T}_t$ satisfying the detailed balance equation. Then one can show that $$ \E_{q_\rightarrow(z_{1:T} | x)} w(z_{1:T} \mid x) = p(x) \quad\Rightarrow\quad \E_{q_\rightarrow(z_{1:T} | x)} \log w(z_{1:T} \mid x) \le \log p(x), $$ $$ \E_{q_\leftarrow(z_{T:1} | x)} \frac{1}{w(z_{1:T} \mid x)} = \frac{1}{p(x)} \quad\Rightarrow\quad \E_{q_\leftarrow(z_{T:1} | x)} \log w(z_{1:T} \mid x) \ge \log p(x) $$

Which gives us another set of sandwich bounds, which we can use to sandwich bound the MI: $$ \boxed{ \E_{q_\leftarrow(x, z_{T:1} | x)} % \E_{p_\theta(x, z_T)} % \E_{ \mathcal{T}_{T}(z_{T-1} \mid z_{T}, x) \cdots \mathcal{T}_{2}(z_{1} \mid z_{2}, x) } \log \frac{p_\theta(x|z_T)}{ w(z_{1:T} \mid x) } \le \text{MI}[p_\theta(x, z)] \le \E_{p_\theta(x, z_0)} \E_{q_\rightarrow(z_{1:T} | x)} \log \frac{p_\theta(x|z_0)}{ w(z_{1:T} \mid x) } } $$

One can also come up with a "decoder-free" version of the bound in a similar fashion to what we've done above. First, introduce the following distributions: $$ \gamma_\rightarrow(z_{1:T} | x) = q(z_1|x) \mathcal{K}_2(z_2 \mid z_1, x) \cdots \mathcal{K}_{T}(z_{T} \mid z_{T-1}, x) $$ $$ \gamma_\leftarrow(x, z_{T:1}) = p(x, z_{T}) \mathcal{K}_{T}(z_{T-1} \mid z_{T}, x) \cdots \mathcal{K}_{2}(z_{1} \mid z_{2}, x) $$ Where $\mathcal{K}_t$ is now tailored to $\kappa_t(z|x) \propto q(z|x)^{1-\beta_t} \left( \hat\varrho(x|z) p(z) \right)^{\beta_t}$ for certain unnormalized $\hat\varrho(x|z)$

Now consider $$ \begin{align*} 0 & \le \text{KL}\left( \gamma_\leftarrow(x, z_{T:1}) \mid\mid p(x) \gamma_\rightarrow(z_{1:T} | x) \right) = \mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})} \log \frac{\gamma_\leftarrow(x, z_{T:1})}{p(x) \gamma_\rightarrow(z_{1:T} | x)} \\ & = \mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})} \log \left[ \frac{p(x, z_{T})}{p(x) q(z_1|x) } \frac{\mathcal{K}_{2}(z_{1} \mid z_{2}, x)}{\mathcal{K}_2(z_2 \mid z_1, x)} \cdots \frac{ \mathcal{K}_{T}(z_{T-1} \mid z_{T}, x) }{\mathcal{K}_{T}(z_{T} \mid z_{T-1}, x)} \right] \\ & = \mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})} \log \left[ \frac{p(x, z_{T})}{p(x) q(z_1|x) } \frac{\kappa_2(z_1|x)}{\kappa_2(z_2|x)} \cdots \frac{ \kappa_T(z_{T-1}|x) }{\kappa_T(z_{T}|x)} \right] \\ & = % q(z|x) \left( \hat\varrho(x|z) \frac{p(z)}{q(z|x)} \right)^{\beta_t} \mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})} \log \left[ \tfrac{p(x, z_{T})}{p(x) q(z_1|x) } \tfrac{ q(z_1|x) \left( \frac{\hat\varrho(x|z_1) p(z_1)}{q(z_1|x)} \right)^{\beta_2} }{ q(z_2|x) \left( \frac{\hat\varrho(x|z_2) p(z_2)}{q(z_2|x)} \right)^{\beta_2} } \cdots \tfrac{ q(z_{T-1}|x) \left( \frac{\hat\varrho(x|z_{T-1}) p(z_{T-1})}{q(z_{T-1}|x)} \right)^{\beta_T} }{ q(z_T|x) \left( \frac{\hat\varrho(x|z_T) p(z_T)}{q(z_T|x)} \right)^{\beta_T} } \right] \\ & = \mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})} \log \left[ \frac{p(x, z_{T})}{p(x)} \frac{1}{p(z_T) \hat\varrho(x|z_T)} \prod_{t=1}^T \left( \frac{\hat\varrho(x|z_t) p(z_t)}{q(z_t|x)} \right)^{\beta_{t+1}-\beta_t} \right] \end{align*} $$ Hence $$ \text{MI}[p(x, z)] \ge \mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})} \log \frac{\hat\varrho(x|z_T)}{\prod_{t=1}^T \left( \frac{\hat\varrho(x|z_t) p(z_t)}{q(z_t|x)} \right)^{\beta_{t+1}-\beta_t}} $$

Now we can again reparametrize this formula in terms of $\hat\rho(x, z) = \hat\varrho(x|z) p(z)$:

$$ \text{MI}[p(x, z)] \ge \mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})} \log \frac{\hat\rho(x, z_T)}{\prod_{t=1}^T \left( \frac{\hat\rho(x, z_t)}{q(z_t|x)} \right)^{\beta_{t+1}-\beta_t}} - \mathbb{E}_{p(z)} \log p(z) $$

It's tempting to simply put $q(z|x) = p(z)$ to obtain a blackbox AIS-based analogue of the InfoNCE bound, however notice that $\gamma_\leftarrow(x, z_{T:1})$ relies on an MCMC kernel that gradually transforms a sample $z_T \sim p(z_T|x)$ into $z_1 \sim p(z_1)$ and thus needs to know this density. Thus I don't think one can use AIS in the blackbox mode.

Finally, note while the bound is valid for any $\hat\rho$ and $\hat\varrho$, it's not quite differentiable w.r.t. their parameters as any proper MCMC method would require an accept-reject step which is not differentiable. Thus if one seeks to learn $\hat\rho$ or $\hat\varrho$, an alternative objective should be used. Luckily, the SNIS-based lower bound with small $K$ would work just fine.


I presented two different ways to give bounds on the MI in cases of variable complexity. The SNIS-based approach seem to be somewhat novel (the special case of InfoNCE has been already known), and is applicable in many different problems. The AIS-based one is based on well-known sandwich bounds on the log marginal likelihoods, but I haven't seen it being applied to the problem of MI estimation. The reason might be that it's more restrictive that the SNIS-based estimations: AIS only works for continuous variables, requires complicated MCMC to function and does not seem to allow blackbox estimators. On the positive side of things, AIS-based estimator should perform much better in high-dimensional problems with large MI, especially if one uses gradient-based kernels like HMC.

Next, I'll share some of my thoughts on drawbacks of these (and some other) bounds, in particular in light of the notorious "Formal Limitations" paper.

  1. See ELBO surgery: yet another way to carve up the variational evidence lower bound 

  2. Oftentimes $D_{KL}(q_\phi(z|x) \mid\mid p(z))$ is used to measure the extent of the so called posterior collapse. One can argue this is an indirect metric as it does not use decoder at all and relies on the encoder's ability to approximate the true posterior better than the prior. Moreover, high $D_{KL}(q_\phi(z|x) \mid\mid p(z))$ is likely to be an indicator of poor encoder $q(z|x)$, whereas the Mutual Information is monotonic in decoder's quality. 

  3. It is possible to give such bounds in case of $q_\phi(z|x)$ being a Neural Sampler. The upper bound is essentially given by IWAE-like tightening, while deriving a lower bound is much more complicated and will be presented in an extended version of the paper. 

  4. As I have argued in the IWHVI paper, it might be beneficial to fit two different encoders for lower and upper bounds, correspondingly. 

comments powered by Disqus