Contrary to optimistic claims in ML literature, I often cannot help but think that deep neural nets are indeed overfit and do not generalize well. But of course that claim hinges on what one means by generalizing well. About this there has been considerable confusion in the more practical engineering oriented ML literature, which at a first glance seems to employ the method of empirical risk minimization (ERM) to select the optimal model. But of course ERM is only justified as inductive principle if necessary and sufficient conditions for generalization hold. These conditions are stated by statistical learning theory (SLT) and are unfortunately hard to prove for current deep learning methods. Thus the engineering approach reverts to a more direct approach to estimate the generalization error, it estimates it from the data directly. But without any other assumptions about the error’s distribution one is not safe from obvious inductive counterexamples like black swans, large deviations and the like. The data are only finite, while the true generalization error has to be evaluated at infinitely many points. This means it is always possible that one’s engineering estimate of the generalization error is off by a wide margin. There is no inductive guarantee. This is the price to pay for leaving the regulated world of SLT.
The confusion between engineering-style generalization and SLT-generalization is best seen in a widely discussed article by Zhang et al. In this article they claim that deep neural networks shatter training data (i.e. ) and generalize well. But from SLT we know that iff ERM is to be consistent. This is equivalent to the generalization gap tending to zero. But if deep hypothesis function spaces shatter the training data then , rendering ERM inconsistent. In other words the prediction from SLT is that hypothesis classes that shatter the training data do not generalize well.
What went wrong?
There are three possibilities:
- Deep hypothesis classes do no shatter the training data.
- Deep neural networks do not generalize well.
- Zhang et al. equivocate on the concept of generalization.
Concerning possibility 1 it can be remarked that Zhang et al. do not formally prove the claim that deep hypothesis classes shatter the training data, they provide numerical evidence for it. The question then becomes: How should we weight this numerical evidence? To answer this question one needs to think about how the well the separation between training and test set was upheld in model choice. If there is test data leakage generalization might turn out as an illusion.
Regarding possibility 2 one has to take a closer look at what SLT actually says. SLT gives only asymptotic error guarantees for consistent estimators. But Zhang et al.’s estimator is, by their own argument, inconsistent. In this case no formal guarantees for generalization the model can be made, indeed under the full distribution one would expect there to be outliers arbitrarily far away from the true risk. This means generalization is an illusion.
To spell out possibility 3 the argument in Zhang et al. must be made explicit in the language of SLT. Generalizing well in SLT means that that a learning algorithm is consistent with respect to some hypothesis class and probability distribution. But for Zhang et al. generalizing well means to be able to distinguish randomized from true labels. This is equivalent to SLT-generalization only under specific assumptions. Thus generalization might be an illusion.
The upshot from this discussion is that it gives us a choice. The choice between two concepts of generalization. Engineering-style generalization for which no inductive guarantees can be given but which is feasible and SLT-generalization which has nice inductive guarantees but is infeasible. Under the strict normative requirements of SLT engineering-style generalizations seem like an illusion. But that will not bother the engineer for whom feasibility trumps infinity.