The more I talk to people online, the more I hear about the famous No Free Lunch Theorem (NFL theorem). Unfortunately, quite often people don’t really understand what the theorem is about, and what its implications are. In this post I’d like to share my view on the NFL theorem, and some other impossibility results.

### No Free Lunch Theorem Revisited

First, let’s formally state the NFL theorem. I’ll take theorem statement from the (freely available!) book *Understanding Machine Learning: From Theory to Algorithms* by Shai Shalev-Shwartz and Shai Ben-David.

In the nutshell, the theorem says that whatever learning algorithm you pick, there will always be a problem (=dataset + some metrics), that your particular algorithm is incapable of solving, even though in principle the problem could be solved (by some other algorithm, which would have its own kryptonite problem). More formally (I modified the statement to distill the notation):

Let \(A\) be any learning algorithm for the task of binary classification with respect to the 0−1 loss over a domain \(\mathcal{X}\) . Let \(m\) be any number smaller than \(|\mathcal{X}|/2\), representing a training set size. Then, there exists a distribution \(D\) over \(\mathcal{X} × \{0, 1\}\) such that:

- There exists a function \(f : \mathcal{X} \mapsto \{0, 1\}\) with \(\mathbb{P}(f(x) \not= y \mid (x, y) \sim D) = 0\).
- With probability of at least 1/7 over the choice of \(S \sim D^m\) we have that \(\mathbb{P}(A_S(x) \not= y \mid (x, y) \sim D) \ge 1/8\)

The idea of the proof is that if you have fixed training set and some nontrivial number of unseen examples, one can vary labels of these unseen examples arbitrarily. So, if your algorithm classifies some example correctly, there exists similar problem with the only difference being different ground truth label for this example. Essentially, for the same training set you can construct completely different test sets.

### Sounds pretty frustrating, isn’t it?

The result suggests impossibility of the universal learning machine that’d be able to take any training set, and make the best predictions possible for the unseen data. And this *is* impossible! Another reformulation of the same theorem says that every classification algorithm has accuracy 1/2 when averaged over all possible problems. However, practical implications of the theorem are not so far-reaching.

The theorem essentially says that every problem has an evil doppelgänger, that’d break your precious model you trained so long. However, how likely are you to run into this doppelgänger? How likely is it to run into the problem where the test set differs from the train set so much? Or, how can our human brains work so well ^{1}? Let me expand the later thought.

I believe our brains are not magical, they are just another kind of a (biological) learning machine, obeying same mathematical principles, powerful enough to solve various problems we face every day. Yes, we can’t solve all the problems in the world, but why would we care? In Machine Learning as a subfield of Artificial Intelligence we seek to solve problems of *practical importance*, and in the first place automate what people already can do. Thus, we have a proof that there exists an algorithm that works reasonably well. It’s right here, in your brain.

So how come we’re able to navigate in such a complex world, communicate in such complicated languages, and discover laws of nature through science by thinking hard enough, if for every problem *we* successfully solve the mathematics has an evil copy of? The answer seems to be that these evil copies are very rare. And I believe there’s a reason for that.

Let’s get back to the theorem. Recall, that is essentially based in the fact that for a fixed training set you can vary test set as you wish. How complicated (for some intuitive notion of complexity) does that make the distribution \(p(y|x)\) that makes perfect predictions for a given \(x\)? Well, if it had one regularity pattern in the training set, and then suddenly changed this pattern in the test set to something completely different, that’d make the target distribution \(p(y|x)\) more complicated. So, even if every good problem (i.e. one we, humans, can solve) has an evil twin, twin’s complexity should be higher due to way more complex regularity pattern.

Thus, I’m sure more complicated problems and objects are less likely in the Universe. Otherwise, we’d not be able to have such a complicated life with our particular instance of a learning machine, implemented in our brain. The NFL theorem states there are hard problems out there, but doesn’t say anything how common they are, implicitly assuming uniform distribution, which seems to disagree with our observations.

### Other impossibility results

Another similar result is the halting problem, which states that given any program, you’d not be able to determine if it stops with 100% accuracy. However, this does not mean that every program’s halting is undecidable. For example, for linear bounded automata one actually can decide if a program for this automation halts (thought that might require some astronomical amount of memory). This result only states there’s no universal decider, and every particular class should be inspected separately.

To recap, the idea of this post is that even though the theory seemingly limits our capabilities, we should not get discouraged by these results, as they are way more general than we need in practice. Quite often we can still solve real problems due to the fact that the general case includes some really weird functions, but reality does not.

Well, we don’t have other baselines, so it’s hard to tell if they indeed work well. But still, human brains are the best learning machines known to humanity.↩