Many tasks of machine learning can be posed as optimization problems. One comes up with a parametric model, defines a loss function, and then minimizes it in order to learn optimal parameters. One very powerful tool of optimization theory is the use of smooth (differentiable) functions: those that can be locally approximated with a linear functions. We all surely know how to differentiate a function, but often it’s more convenient to perform all the derivations in matrix form, since many computational packages like numpy or matlab are optimized for vectorized expressions.

In this post I want to outline the general idea of how one can calculate derivatives in vector and matrix spaces (but the idea is general enough to be applied to other algebraic structures).

### The Gradient

What is the gradient? Recall that smooth function (for now we’ll be considering scalar functions only) \(f : \mathcal{X} \to \mathbb{R}\) is one which is approximately linear within some neighborhood of a given point. That means \(f(x + dx) - f(x) = \langle g(x), dx \rangle\) (think of \(dx\) as of a very small perturbation of \(x\)) where \(\langle \cdot, \cdot \rangle\) denotes the dot product in the space \(\mathcal{X}\), and \(g(x)\) is called the **gradient** of \(f(x)\) at the point \(x\) (we’ll be using the nabla notation from now on: \(g(x) = \nabla f(x)\)).

For example, for functions of one variable (\(\mathcal{X} = \mathbb{R}\)) we have \(\langle a, b \rangle = a b\), for functions of several variables (\(\mathcal{X} = \mathbb{R}^n\)) it’s the usual dot product \(\langle a, b \rangle = a^T b\), and for functions of matrices (\(\mathcal{X} = \mathbb{R}^{n \times m}\)) it generalizes vector dot product: \(\langle A, B \rangle = \text{Tr}(A^T B)\).

Now let’s introduce the notion of **differential** \(df(x)\) to be a perturbation of the function \(f\) if we perturb \(x\) by \(dx\), which we assume to be infinitesimally small. The gradient only affects first-order behavior of \(df(x)\), that is as \(dx\) goes to zero, thus if one expands \(df(x)\) in terms of \(dx\), it’s enough to write down only the linear term to find the differential. For example, for smooth scalar-valued functions we have \(df(x) = \langle \nabla f(x), dx \rangle\), so the gradient \(\nabla f(x)\) defines a linear coefficient for the differential \(df(x)\) ^{1}.

### Calculus

Okay, how can we derive the gradient of a function? One way is to take a derivative with respect to each scalar input variable, and then compose a vector out of it. This approach is quite inefficient and messy: you might have to recompute the same vector expressions for each input variable, or compute lots of sums (which does not leverage benefits of vector operations), or try to compose vector operations out of them. Instead we’ll develop a formal method that will allow us to derive gradients for many functions (just like differentiation rules you learned in the introduction to calculus) without leaving the realm of vector algebra.

Recall that the differential \(df(x)\) of a scalar-valued function \(f\) is a linear function of \(dx\) and the gradient. That means, that if we could write a differential \(df(x)\) and then simplify it to \(g(x)^T dx + O(\|dx\|^2)\) (we ignore higher-order terms, as they are not linear in \(dx\), and go to zero faster than \(dx\) does), we’ll recover the gradient \(\nabla f(x) = g(x)\). This is exactly what we’re going to do: develop a set of formal rules that will allow us to compute differentials of various operations and their combinations.

The general idea is to consider \(f(x + \Delta)\), and manipulate it into something of the form \(f(x) + L_x(\Delta) + O(\|\Delta\|^2)\) where \(L_x(\Delta)\) is a function of \(x\) (maybe constant, though) and \(\Delta\) that is linear in \(\Delta\) (but not necessarily in \(x\)). Then \(L_x(dx)\) is exactly the differential \(df(x)\).

Let’s consider an example. Let \(f(X) = A X^{-1} + B\) (all variables are square matrices of the same size):

\[ \begin{align*} f(X + \Delta) &= A (X + \Delta)^{-1} + B = A X^{-1} (I + \Delta X^{-1})^{-1} + B \\ &= A X^{-1} \left(\sum_{k=0}^\infty (-\Delta X^{-1})^k \right) + B = A X^{-1} \left(I - \Delta X^{-1} + O(\|\Delta\|^2) \right) + B \\ &= \underbrace{A X^{-1} + B}_{f(X)} \underbrace{-A X^{-1} \Delta X^{-1}}_{\text{linear in }\Delta} + O(\|\Delta\|^2) \end{align*} \]

Hence \(df(X) = -A X^{-1} dX X^{-1}\) (it’s not of the form \(\text{Tr}(g(X)^T dX)\) because \(f\) is not scalar-valued, so you can’t extract the gradient from it, because the “gradient” would be something like a 4-dimensional tensor). This way we can derive differentials for many common functions and operations:

\[ d(\alpha X) = \alpha dX \\ d(X + Y) = dX + dY \\ d(XY) = dX Y + X dY \\ d(X^{-1}) = -X^{-1} dX X^{-1} \\ d(c^T x) = c^T dx \\ d(x^T A x) = x^T (A + A^T) dx \\ d(\text{Tr}(X)) = \text{Tr}(dX) \\ d(\text{det}(X)) = \text{det}(X) \text{Tr}(X^{-1} dX) \]

It’s also very helpful to derive a rule to deal with function composition: suppose we have \(f(x)\) and \(g(x)\) with corresponding differentials \(df(x)\) and \(dg(x)\). Then the differential of \(h(x) = f(g(x))\) is

\[ dh(x) = f(g(x+dx)) - f(g(x)) = f(g(x) + dg(x)) - f(g(x)) = df(y)|_{dy = dg(x), y = g(x)} \]

That is, we take \(df(y)\), and replace each \(dy\) with \(dg(x)\), and each \(y\) with \(g(x)\).

These rules allow us to differentiate fairly complicated expressions like \(f(x) = \text{det}(X + B) \log(a^T X^{-1} a) - \text{Tr}(X)\)

\[ \begin{align*} df(X) &= d(\text{det}(X + B) \log (a^T X^{-1} a)) - d(\text{Tr}(X)) \\ &= d(\text{det}(X + B)) \log (a^T X^{-1} a) + \text{det}(X + B) d(\log (a^T X^{-1} a)) - \text{Tr}(dX) \\ &= \text{det}(X + B) \text{Tr}((X+B)^{-1} dX) \log (a^T X^{-1} a) + \frac{\text{det}(X + B)}{a^T X^{-1} a} d(a^T X^{-1} a) - \text{Tr}(dX) \\ &= \text{Tr}\left[\text{det}(X + B)\log (a^T X^{-1} a) (X+B)^{-1} dX \right] - \frac{\text{det}(X + B)}{a^T X^{-1} a} \left(a^T X^{-1} dX X^{-1} a\right) - \text{Tr}(dX) \\ &= \text{Tr}\left[\left(\text{det}(X + B)\log (a^T X^{-1} a) (X+B)^{-1} - \frac{\text{det}(X + B)}{a^T X^{-1} a} X^{-1} a a^T X^{-1} - I\right) dX \right] \\ \end{align*} \]

One way to sanity-check (not complete, though!) our derivations is to consider \(1 \times 1\) matrices, that is, scalar case. In scalar case it all boils down to \(f(x) = (x+b) \log(a^2 / x) - x\) with derivative \(f'(x) = \log(a^2 / x) - \frac{x+b}{x}-1\), which coincides with the formula above for \(1 \times 1\) matrices.

### The Hessian

The same idea can be used to calculate the hessian of a function, that is, a coefficient describing function’s local quadratic behavior. We restrict ourselves with scalar-valued functions of finite-dimensional vectors, but it generalizes to other functions if you consider appropriate bilinear maps.

We define the second order differential recursively \(d^2 f(x) = d(df(x))\) as a linearization of a linearization (and we need to go deeper!). One might note that linearization of a linear function does not make any difference, but we’re actually linearizing not the linear approximation, but the map \(x \mapsto df(x)\) itself. In a way, you can think of \(df(x)\) as a function of 2 arguments: the point \(x\) and an infinitesimal perturbation \(dx\). And we’re linearizing with respect to the first one. Since we have 2 independent linearizations, it’s incorrect to use the same perturbation to both of them, so we’ll introduce \(dx_1\) and \(dx_2\) as first and second order perturbations.

If \(df^2(x)\) at a given point \(x\) is a linearization of a linearization, it’s a function of 2 perturbations: \(dx_1\) and \(dx_2\). Moreover, it’s linear in both of them, so \(df^2(x)\) is actually a bilinear map. In case of a finite-dimensional vector space a bilinear map can be represented using a matrix \(H(x)\), that is \(d^2f(x) = dx_1^T H(x) dx_2\). The matrix \(H(x)\) is called the **hessian** and denoted \(\nabla^2 f(x)\).

Then one uses the same formal rules, expanding \(d^2 f(x) = d(df(x))\) by first computing \(df(x)\) w.r.t. \(dx_1\), and then differentiating the resultant expression w.r.t. \(dx_2\). Again, let’s consider an example \(f(x) = \text{det}(I + x x^T)\)

\[ df(x) = 2\text{det}(I + x x^T) x^T (I + x x^T)^{-1} dx_1 = 2 f(x) x^T (I + x x^T)^{-1} dx_1 \]

Now, keeping in mind that we can move scalars around (as well as transpose them), we get

\[ \begin{align*} d^2f(x) &= d(df(x)) \\ &= 2 \overbrace{df(x)}^{=dx_2^T \nabla f(x)} x^T (I + x x^T)^{-1} dx_1 + 2 f(x) d(x^T) (I + x x^T)^{-1} dx_1 + 2 f(x) x^T d((I + x x^T)^{-1}) dx_1 \\ &= 2 dx_2^T \left( \nabla f(x) x^T (I + x x^T)^{-1} + f(x) (I + x x^T)^{-1} \right) dx_1 \\ &\quad- 2 f(x) x^T (I + x x^T)^{-1} (dx_2 x^T + x dx_2^T) (I + x x^T)^{-1} dx_1 \\ &= 2 dx_2^T \left( \nabla f(x) x^T (I + x x^T)^{-1} + f(x) (I + x x^T)^{-1} \right) dx_1 \\ &\quad- 2 f(x) \overbrace{x^T (I + x x^T)^{-1} dx_2}^{\text{scalar, transpose}} x^T (I + x x^T)^{-1} dx_1 - 2 \overbrace{f(x) x^T (I + x x^T)^{-1} x}^{\text{scalar, trace}} dx_2^T (I + x x^T)^{-1} dx_1 \\ &= 2 dx_2^T \left( \nabla f(x) x^T (I + x x^T)^{-1} + f(x) (I + x x^T)^{-1} \right) dx_1 \\ &\quad- 2 dx_2^T f(x) (I + x x^T)^{-1} x x^T (I + x x^T)^{-1} dx_1 - 2 dx_2^T f(x) \text{Tr}\left[ (I + x x^T)^{-1} x x^T \right] (I + x x^T)^{-1} dx_1 \\ &= 2 dx_2^T \Bigl( \nabla f(x) x^T (I + x x^T)^{-1} + f(x) (I + x x^T)^{-1} \\ &\quad -f(x) (I + x x^T)^{-1} x x^T (I + x x^T)^{-1} -f(x) \text{Tr}\left[ (I + x x^T)^{-1} x x^T \right] (I + x x^T)^{-1} \Bigr) dx_1 \\ \end{align*} \]

Thus the Hessian is

\[ \begin{align*} \nabla^2 f(x) &= 2 (I + x x^T)^{-1} x \left(\nabla f(x)^T - f(x) x^T (I + x x^T)^{-1} \right) + 2 f(x) \left(1 - \text{Tr}\left[ (I + x x^T)^{-1} x x^T \right] \right) (I + x x^T)^{-1} \\ &= (I + x x^T)^{-1} x \nabla f(x)^T + \left(2 f(x) - \nabla f(x)^T x \right) (I + x x^T)^{-1} \\ &= (I + x x^T)^{-1} x \nabla f(x)^T - \nabla f(x)^T x (I + x x^T)^{-1} + 2 f(x) (I + x x^T)^{-1} \\ &= 2f(x) \left((2 - x^T (I + x x^T)^{-1} x) I - (I + x x^T)^{-1}\right) (I + x x^T)^{-1} \\ \end{align*} \]

The funny thing is, \(f(x) = \text{det}(I + x x^T)\) can be simplified using the determinant lemma as \(f(x) = 1 + x^T x\). Now this is a very simple function, whose gradient is just \(2x\) and the Hessian is constant \(2I\). And actually, the expression above can be simplified into \(2I\). Sometimes it’s beneficial to simplify the function first (-:

### Conclusion

In this post I showed how one can derive gradients and hessians using formal algebraic manipulations with differentials. The same technique is, of course, applicable to infinite-dimensional spaces (calculus of variations) and vector-valued functions (where linear map is described using the Jacobi matrix).

### Addendum on the Dual Numbers

The set of formal rules described above is not only helpful when calculating gradients by hand, but can also be used to automatically differentiate a function as you evaluate it. Indeed, suppose you need to differentiate some big and complex function \(f(x)\). In the above I shoved how one can use formal rules to compute \(d f(x)\), rearrange the result into the form of \(\langle g(x), dx\rangle\), and use \(g(x)\) as the gradient. Note that if we use Taylor expansion of \(f(x+dx)\) at \(x\) we get \(f(x+dx) = f(x) + \langle \nabla f(x), dx \rangle + O(\|dx\|^2)\) and neither \(f(x)\) nor \(\nabla f(x)\) contain (or depend on) \(dx\). This means that if we extend our set of numbers by a symbol \(dx\) with a property \(\|dx\|^2 = 0\) (much like we have obtained complex numbers by adding a symbol \(i\) to the real numbers with a special property \(i^2 = -1\)), and evaluate \(f(x+dx)\) in this expanded algebra, we will obtain an expression of the form \(a + \langle b, dx \rangle\) with \(a = f(x)\) and \(b = \nabla f(x)\). And this is a well-known extension of the real numbers called dual numbers.

If you’re wondering about the particular way of combining the gradient \(\nabla f(x)\) and \(dx\), here’s the explanation. Recall, that the first-order term is the linearization of a function, that is, it’s linear transformation \(L_x\) of \(dx\). Because of the Riesz representation theorem this linear transformation \(L_x\) can be represented as a scalar product with some element of \(\mathcal{X}\) (the gradient, in our case): \(L_x(dx) = \langle \nabla f(x), dx \rangle\). Of course, this logic generalizes to non-scalar-valued functions (like \(f : \mathbb{R}^n \to \mathbb{R}^m\)): the gradient is used to define a linear map.↩