Random notes mostly on Machine Learning

Thoughts on Mutual Information: Alternative Dependency Measures

This posts finishes the discussion started in the Thoughts on Mutual Information: More Estimators with a consideration of alternatives to the Mutual Information.

Mutual Information

Let’s step out a bit and take a critical look at the MI. One of its equivalent definitions says that it’s a KL-divergence between the joint distribution and the product of marginals: \[ \text{MI}[p(x, z)] = D_{KL}(p(x, z) \mid\mid p(x) p(z)) \]

Indeed, if the random variables \(X\) and \(Z\) are independent, then the joint distribution \(p(x, z)\) factorizes as \(p(x) p(z)\), and the KL (or any other divergence or distance between probability distributions) is equal to zero. Conversely, the more \(X\) and \(Z\) are dependent, the further the joint \(p(x, z)\) deviates from the product of marginals \(p(x) p(z)\).

But why this particular choice of divergence?

Why not Jeffreys divergence, Jensen-Shannon divergence, Total Variation distance or Wasserstein distance?

The answer to this question lies in the entropic form of the MI: \[ \text{MI}[p(x, z)] = D_{KL}(p(x, z) \mid\mid p(x) p(z)) = \mathbb{H}[p(x)] - \E_{p(z)} \mathbb{H}[p(x|z)] \]

The Mutual Information is equal to average reduction in entropy (a measure of uncertainty) of \(X\) when we know \(Z\). Such information-theoretic interpretation is the main reason the MI is so widespread. However, there’s a major issue when \(X\) and \(Z\) are continuous: these entropies become differential ones, and the differential entropy does not enjoy the same uncertainty-measuring interpretation as the discrete one does.

One particular issue with the continuous Mutual Information is the following one: if \(\text{Pr}[X = Z] = 1\), then the MI attains its maximal value. In the discrete case this maximal value is equal to the entropy of \(X\) and finite, but in the continuous case it’s equal to \(+\infty\) 1. Moreover, imagine \(X\) and \(Z\) are \(N\)-dimensional random vectors s.t. \(\text{Pr}(X_1 = Z_1) = 1\) and the rest components are all independent random variables. Then, it’s easy to show that the MI \(I(X, Z) = +\infty\) regardless of \(N\)! So if \(N\) is in billions these vectors are mostly independent, but one pesky component ruined it all, and we ended up with an infinite mutual “information”.

I hope this convinced you there’s no information-theoretic interpretation in the continuous case to hold on to the particular choice of divergence between \(p(x, z)\) and \(p(x) p(z)\), which means we’re free to explore alternatives…


KL divergence is a special case of \(f\)-divergences given by \(f(t) = t \log t\). However, other choices are perfectly legal, too:

  1. Jensen-Shannon divergence corresponds to \(f(t) = t \log t - (t+1) \log \tfrac{t+1}{2}\).
  2. Total Variation distance corresponds to \(f(t) = |t-1|\).
  3. Jeffreys divergence is given by \(f(t) = \tfrac{t-1}{2} \log t\).
  4. Reverse KL divergence is given by \(f(t) = -\log t\).

So, one can consider \(f\)-Mutual Information defined as \[ I_f(X, Z) := D_f(p(x, z) \mid\mid p(x) p(z)) = \E_{p(x) p(z)} f\left(\frac{p(x, z)}{p(x) p(z)}\right) \]

In general, however, KL, Reverse KL and combinations thereof are the only \(f\)-divergences that are additive for independent random variables: if \(X_1 \perp X_2\) under both \(p(x)\) and \(q(x)\), then \[\text{KL}(p(x) \mid\mid q(x)) = \text{KL}(p(x_1) \mid\mid q(x_1)) + \text{KL}(p(x_2) \mid\mid q(x_2))\] And thus for \(X_1 \perp X_2\) and \(Z_1 \perp Z_2\) \[ I_f(X, Z) = I_f(X_1, Z_1) + I_f(X_2, Z_2) \Leftrightarrow \text{$f$ is a combination of KLs} \] Is such additivity important, though? Imagine having a sample set of independent objects \(X_1, \dots, X_N\) used to extract corresponding representations \(Z_1, \dots, Z_N\). In general with \(f\)-MI you’re not allowed to use stochastic optimization / minibatching to work with \(I_f(X, Z)\). This is counter-intuitive and not something we’d expect from a measure of information.

That said, there some thing to keep in mind:

  1. In practice you probably can use \(\sum_{n=1}^N I_f(X_n, Z_n)\) instead of \(I_f(X, Z)\) without having to suffer any consequences.
  2. In some special cases such additivity does hold. These are cases of KL divergence, Reverse KL divergence and any combination thereof.

Lautum Information

Palomar and Verdu introduced the Lautum Information (I particularly liked their naming: Lautum is Mutual backwards): an analogue of the Mutual Information with KL’s arguments swapped:

\[ \text{LI}[p(x, z)] = D_{KL}(p(x) p(z) \mid\mid p(x, z)) \]

It an be equivalently rewritten as

\[ \begin{align*} \text{LI}[p(x, z)] &= \E_{p(x) p(z)} \log \frac{p(x) p(z)}{p(x, z)} = \E_{p(x) p(z)} \log \frac{p(z)}{p(z|x)} \\ &= -\E_{p(x) p(z)} \log p(z|x) - \mathbb{H}[p(z)] \end{align*} \]

Notice that in general the first term is not an entropy, but rather a cross-entropy. Unfortunately, this cross-entropy term lacks intuitive information-theoretic interpretation – a distinctive feature of the Mutual Information.

Another disadvantage of the Lautum Information is that even for discrete random variables it’s infinite when \(X = Z\). Since one is sampling from the product of marginals \(p(x) p(z)\), you’ll inevitably have some probability mass in the area where \(x \not= z\), but \(p(x, z)\) will be exactly zero in such regions, hence logarithm’s argument will be infinite. However, for continuous random variables, even the standard Mutual Information will be infinite in the case of a deterministic invertible dependency.

But how do we estimate the LI? First, the good news is that its only entropy term comes with a minus sign, hence if you seek to lower bound the LI, then the formal limitations theorem does not apply. Unfortunately, I’m not aware of any good black-box bounds on the cross-entropy, so we’ll have to assume at least one conditional to be known, say, \(p(x|z)\). For the other term we can use any of the plethora of bounds on the log marginal likelihood: \[ \begin{align*} \text{LI}[p(x, z)] &= -\E_{p(x) p(z)} \log p(x|z) + \E_{p(x)} \log p(x) \\ &= -\E_{p(x) p(z)} \log p(x|z) + \E_{p(x)} \log q(x) + \E_{p(x)} \log \frac{p(x)}{q(x)} \\ &= -\E_{p(x) p(z)} \log p(x|z) + \E_{p(x)} \log q(x) + D_\text{KL}(p(x) \mid\mid q(x) ) \\ &\ge \E_{p(x) p(z)} \log \frac{q(x)}{p(x|z)} \end{align*} \]

Where \(q(x)\) is any (variational) distribution with the same support as \(p(x)\). However, notice that we’ve already assumed \(p(x|z)\) to be known. With an additional assumption of being able to sample from this conditional we can use SIVI to give an (non-black-box) lower bound on the entropy of \(p(x)\): \[ \begin{align*} \text{LI}[p(x, z)] &= \E_{p(x)} \log p(x) - \E_{p(x) p(z)} \log p(x|z) \\ &\le \E_{p(x, z_0)} \E_{p(z_{1:K})} \log \left( \frac{1}{K+1} \sum_{k=0}^K p(x|z_k) \right) - \E_{p(x) p(z)} \log p(x|z) \\ &= \E_{p(x, z_0)} \E_{p(z_{1:K})} \left[ \log \left( \frac{1}{K+1} \sum_{k=0}^K p(x|z_k) \right) - \frac{1}{K} \sum_{k=1}^K \log p(x|z_k) \right] \\ \end{align*} \]

Moreover, if the marginal distribution \(p(z)\) is known, IWHVI provides a better estimate at the cost of introducing a variational distribution \(q(z|x)\): \[ \begin{align*} \text{LI}[p(x, z)] &= \E_{p(x)} \log p(x) - \E_{p(x) p(z)} \log p(x|z) \\ &\le \E_{p(x, z_0)} \E_{q(z_{1:K}|x)} \log \left( \frac{1}{K+1} \sum_{k=0}^K \frac{p(x, z_k)}{q(z_k|x)} \right) - \E_{p(x) p(z)} \log p(x|z) \end{align*} \]

Analogously, one can use IWAE bounds to arrive at the following lower bound on the LI: \[ \begin{align*} \text{LI}[p(x, z)] &= \E_{p(x)} \log p(x) + \E_{p(x) p(z)} \log p(x|z) \\ &\ge \E_{p(x)} \E_{q(z_{1:K}|x)} \log \left( \frac{1}{K} \sum_{k=1}^K \frac{p(x, z_k)}{q(z_k|x)} \right) - \E_{p(x) p(z)} \log p(x|z) \end{align*} \]

Wasserstein Distance

An interesting alternative to both Forward and Reverse KLs is the Wasserstein Distance aka Kantorovich–Rubinstein distance aka optimal transport distance. Formally, Wasserstein-\(p\) metric \(W_p\) defined as \[ W_p(p(x), q(x)) := \left( \inf_{\gamma \in \Gamma(p, q)} \E_{\gamma(x, y)} \|x-y\|_p^p \right)^{1/p} \]

Where \(\Gamma(p, q)\) is a set of all possible joint distributions over \(x\) and \(y\) s.t. \(p(x)\) and \(q(y)\) are its respective marginals: \[ \gamma \in \Gamma(p, q) \Leftrightarrow \int_{\text{dom}(y)} \gamma(x, y) dy = p(x), \quad \int_{\text{dom}(x)} \gamma(x, y) dx = q(y) \]

In particular, we’ll be considering Wasserstein-1 distance \(W_1\): \[ W_1(p(x), q(x)) := \inf_{\gamma \in \Gamma(p, q)} \E_{\gamma(x, y)} \|x-y\|_1 \]

The Wasserstein-1 distance lies at the heart of the Wasserstein GAN, and it was this paper that suggested to use Kantorovich–Rubinstein duality to estimate the distance in practice: \[ W_1(p(x), q(x)) \ge \E_{p(x)} f(x) - \E_{q(x)} f(x) \] Where \(f\) is any 1-Lipschitz function. It also seems to be additive for independent random variables – a nice property to have.

Now we can use this tractable lower bound to estimate the Wasserstein Dependency Measure \(W_1(p(x, z), p(x) p(z))\) – a Wasserstein analogue of the Mutual Information2: \[ W_1(p(x, z), p(x) p(z)) \ge \E_{p(x, z)} f(x, z) - \E_{p(x) p(z)} f(x, z) \]

You can notice that this lower bound is similar to the Nguyen Wainwright Jordan lower bound on KL. Unfortunately, it’s not known whether this bound is efficient or if it also exhibits large variance. Thorough theoretical analysis of the Kantorovich–Rubinstein lower bound is an interesting research question.


The Mutual Information is far from being the only dependency measure, yet it’s the most intuitive one, with the intuition coming from the information theory. However, as I argued here, this nice interpretation goes out of the window once you introduce continuous random variables, so no need to stick to the particular choice of the MI, especially given that there’re lots of alternative dependency measures. With the Formal Limitations paper preventing us from having practical black-box lower bounds on the MI, I believe more and more researchers will study alternative dependence measures, and we already see some pioneering work, such as Wasserstein Dependency Measure.

  1. This is, of course, due to the fact that a real number requires an infinite amount of bits to be written exactly, allowing one to indeed store an infinite amount of information in a truly real number.

  2. Notably, authors of the paper did not name their measure something like Wasserstein Mutual Information. Technically speaking, only the KL-based MI can be called something-information since, as we’ve discussed already, it’s the only dependency measure that has information-theoretic interpretation. In that sense, Lautum Information should have been named differently.

comments powered by Disqus