Random notes mostly on Machine Learning

Reciprocal Convexity to reverse the Jensen Inequality

Jensen’s inequality is a powerful tool often used in mathematical derivations and analyses. It states that for a convex function \(f(x)\) and an arbitrary random variable \(X\) we have the following upper bound: \[ f\left(\E X\right) \le \E f\left(X\right) \]

However, oftentimes we want the inequality to work in the other direction, to give a lower bound. In this post I’ll outline one possible approach to this.

The Trick

The basic idea is very simple: let’s turn our convex function into a concave function. First, define

\[ \hat{f}(x) = f\left(\tfrac{1}{x}\right) \]

As defined by Merkle, a function \(h(x)\) is called reciprocally convex if \(h(x)\) is concave and \(\hat{h}(x) = h(1/x)\) is convex. For the sake of this discussion we’ll assume \(f(x)\) is reciprocally concave, that is, of course, \(-f(x)\) is reciprocally convex.

Next, we’ll need an unbiased estimator \(Y\) of the reciprocal of the mean \(\E X\), that is, \(Y\) should satisfy the following:

\[ \E Y = \frac{1}{\E X} \]

Now, the rest is simple algebra and the standard Jensen’s inequality (remember \(\hat{f}\) is concave by definition): \[ f\left(\E X\right) = \hat{f}\left(\frac{1}{\E X}\right) = \hat{f}\left(\E Y\right) \ge \E \hat{f}\left(Y\right) = \E f\left(\frac{1}{Y}\right) \tag{1} \]


This trick is actually the reason why we can have both upper and lower bounds on the log marginal likelihood in latent variable models. Indeed, consider the following example:

\[ f(x) := -\log(x), \quad X := p(x \mid Z), \quad Z \sim p(z) \]

This is the standard Variational Inference setup. Putting it all together, we’d like to give bounds on

\[ -\log \left( \E_{Z \sim p(z)} p(x|Z) \right) = -\log p(x) \]

Normally, in VI we use the standard Jensen’s Inequality to obtain an upper bound on this negative log-likelihood, and all is good. However, sometimes we need lower bounds on the same quantity. This is where the framework above comes to the rescue.

First, it’s easy to see that we’re very lucky – \(f(x)\) is indeed reciprocally concave: \(-\log(x)\) is convex, and \(-\log\tfrac{1}{x} = \log(x)\) is concave.

Next, we need an unbiased estimator \(Y\) of the inverse mean of \(X\), that is, an unbiased estimator of \(1/p(x)\). Such estimator can be given this way:

\[ \frac{1}{p(x)} = \int \frac{q(z)}{p(x)} dz = \int \frac{q(z) p(z|x)}{p(x) p(z | x)} dz = \E_{p(z|x)} \frac{q(z)}{p(x, z)} \]

Where \(q(z)\) is an arbitrary distribution. Thus, the estimator is \(Y\) generated by r.v. \(Z\): \[ Y := \frac{q(Z)}{p(x, Z)}, \quad Z \sim p(z|x) \]

Now, putting these into (1) we obtain: \[ -\log p(x) \ge -\E_{p(z|x)} \log \frac{p(x, z)}{q(z)} \] Or, equivalently, \[ \log p(x) \le \E_{p(z|x)} \log \frac{p(x, z)}{q(z)} \] By the way, for comparison, here’s the classical lower bound obtained through the standard Jensen’s Inequality. Curiously, the only difference is where the random variables \(z\) are coming from: \[ \log p(x) \ge \E_{q(z)} \log \frac{p(x, z)}{q(z)} \]


Why limit ourselves to a particular \(\hat{f} = f \circ (1/x)\)? One can consider other invertible functions \(g(x)\) instead of the \(1/x\). Here’s the recipe:

  • Define \(f^{[g]}(x) = f(g(x))\)
  • First, we need \(f^{[g]}(x)\) to be concave
  • Second, we need an unbiased estimator \(Y\) of \(g^{-1}(\E X)\)

This leads to a generalization of (1): \[ f\left(\E X\right) = f^{[g]}\left( g^{-1}(\E X) \right) = f^{[g]}\left( \E Y \right) \ge \E f^{[g]}\left( Y \right) = \E f\left( g(Y) \right) \]


This trick is simple, and perhaps obvious even without any fancy words such as reciprocal convexity. Moreover, it has its limitations: you either need to get lucky with \(f(x)\) being reciprocally concave, or need to find an invertible \(g(x)\) such that \(f \circ g\) is concave. But even that’s not enough, as you also need to construct an unbiased estimator \(Y\), and if you fancy practical applications, efficiency of the resulting bound will heavily depend on the quality of this estimator.

Nevertheless, I believe this is an interesting idea and it might prove itself useful in various analyses and derivations.

Tagged as math
comments powered by Disqus