I recently attended Tom’s closing workshop on his Philosophy of statistical learning theory project. It was a great workshop and I learned a great deal from the talks. I provide a streamlined version of notes I took, for all those who were interested but couldn’t attend. The abstracts of the talks can be found here: https://www.mcmp.philosophie.uni-muenchen.de/events/workshops/container/ml_2023/index.html#schuster.
Reliable AI – Gitta Kutyniok
Gitta started her talk by quoting Rahimi’s famous remark that ML has become alchemy, noting that now – five years later – we still cannot make a principled choice of which ML method to pick. There has, she argued, some progress though. Mainly in the areas of privacy (differential privacy), security (crypto methods) and responsibility (explainability, fairness). In the area of safety, i.e. the mathematical guarantees of error bounds etc. there has been much less progress. After a quick artificial neural networks for beginners tutorial, she pointed out several open epistemic questions for ANNs. How does their architecture affect their learning performance? Why does stochastic gradient converge to a good minimum? Can we give overall success guarantees for generalization? These questions are all unanswered so far.
Her own work revolves around what she called “limitations of reliability” and she presented two theorems she and her coworkers proved which exemplify such limitations. The first theorem roughly states that the solution of a finite-dimensional inverse problem (i.e. a learning problem) is not computable by a DNN on a certain type of finite state automaton (think Turing machine). The second one states that the solution of a finite-dimensional inverse problem is computable by a DNN on an analog machine (this basically means you harness the power of real numbers).
Gitta thinks that new modes of computing like neuromorphic computing might provide the conditions for the second theorem to apply.
Now what do we make from this? I think the neither the limiting nor the anti-limiting result have much practical importance. Against the limiting result it might be said that maybe computability isn’t a relevant property of interesting learning problems anyhow, maybe it is sufficient if they are semi-computable. Even if not, on could say that one should restrict the learning problem to computable functions – similar to how Solomonoff induction is restricted. I haven’t looked into the maths in detail but I would guess that the breakdown of learning in the limiting theorem is for similar reasons that Solomonoff induction on all measures fails.
The anti-limiting theorem seems to say something interesting about the possibility of learning any function on an analog computer. But note, that this practically boils down to on how good we are at establishing the initial state for the computation with arbitrary preciseness. There was a reason for why analog computers where replaced with digital ones and it had no little to do with error control. So in the end, the anti-limiting theorem externalizes the computational problem to an engineering problem. And as of now there is no indication that we can ever solve this problem.
Reverse engineering the model – Jan-Willem Romeijn
Jan-Willem wants to make the centuries of work in mathematics and philosophy of induction fruitful for ML. The central idea here is, that induction should rely on data alone, according to some Carnapian prediction rule. One than can employ de Finetti style representation theorems to figure out the prior probability over an underlying multinomial distribution. The prediction rule represents (representation theorem) the distribution. Transferring this framework to ML the hope is, that representation theorems will help uncover inductive assumptions (in the form of priors) of the prediction methods of ML. The general gist is this: View ML as predictive method in Carnap’s spirit. Prove some representation theorem about it. Use it to make the prior of the method explicit.
Of course, in the case of most ML methods, we cannot directly cast them as Carnapian prediction rules (that would be too easy, right?). So JW proposes that we use time-honed statistical methods to sample their outputs and thereby feed a statistical model. Everything hinges on this approximation step: we approximate the ML model with a statistical model, of which we have a better grasp. Will this approximation be controllable in a fashion that gives us understanding? I don’t know, but I’m very curious to see a toy example worked out where we actually have the Carnapian prediction rule for some ML method.
The curve fitting problem revisited – Oliver Buchholz
“All the impressive achievements of deep learning amount to just curve fitting” said Judea Pearl and if this is not a reason to revisit curve fitting, then I don’t know what is.
Oliver proposed a puzzle if we take this position seriously. There seems a general consensus in the community (an even rarer consensus between philosophers and scientists it seems!) that curves that are overfit don’t generalize well. Neural nets generalize well and are extremely overfit. The puzzle immediately follows. If they are just curve fittings, why the hell are they generalizing that well? Oliver used this puzzle to delve into a discussion of simplicity in curve fitting, suggestion that this notion might be different in ML contexts. I have an alternative take. Why do we take it for granted that ML models, specifically neural nets, generalize well? For every method, it seems to be easy to come up with other methods that generate adversarials. Doesn’t this already mean that they don’t generalize well?
Philosophical considerations on abstaining ML – Daniela Schuster
To be honest this was the first time I heard the term abstaining ML. Abstaining ML are decision methods that abstain from decisions. This can be desirable for several reasons, for example if the risks associated with misclassifications are too high. In her talk Daniela provided a philosophical taxonomy of different methods of abstention. Her first distinction is between outlier and ambiguity abstention, the outlier abstention being of positive nature, while ambiguity abstention being of privative nature. So an outlier is actually something you can (positively) point at in your justification for abstention, while an ambiguity you cannot so easily point out. The taxonomy has several further classes (subclasses), like attached and merged (labelled and unlabelled) abstentions, which basically have to do with the amount of human involvement in abstaining. I guess this framework could be interesting for ethicists of XAI, if one was able to easily point out the amount of human involvement in the decision under question. But, alas, I think this is very hard to do, so I don’t know if such a framework will ever be applicable.
Philosophical aspects of unsupervised learning – David Watson
David pointed out that unsupervised learning hasn’t shared the same philosophical attention than it’s supervised cousin. And I think he is right about that! He posed the problem of giving a definition of unsupervised learning without reference to supervised learning, because the definitions in the literature are only contrastive. Maybe our working definition is fine as is – I am not sure, but the latter part of his talk didn’t hinge on finding a correct definition but rather connected unsupervised learning with (ancient) problems of philosophy, the obvious ones being natural kinds and essential properties. For example in the case of clustering methods (k-means) he suggested an ontological and an epistemic claim. The latter being that we learn natural kinds through clustering algorithms, while the former being that natural kinds actually are what clustering algorithms ought to find in ideal circumstances. This reminded me of a talk my friend Ryan gave ages ago.. It is really unfortunate that since then nothing seems to have been done on these topics.
David concluded his talk by reminding everyone that even though human involvement might be lower in unsupervised learning, there is prima facie nothing objective about what is learned.
Reconsidering Randomization – Konstantin Genin
Konstantin somehow managed to smuggle a quite unrelated topic into this ML conference. But that is fine, we like RCTs. He was basically arguing along the lines of Deaton and Cartwright, attacking the notion of RCTs as a gold standard of clinical trials. The ethical costs of RCTs – namely the assignments of trialists to the placebo arm – must be balanced against the epistemic goods they provide. If there are epistemically better tools at hand, one is morally obliged to use them rather than RCTs. So what is the epistemic good RCTs provide and are there alternatives? Konstantin of course thinks that there are. If an unbiased estimator of the average treatment effect is the epistemic goal RCTs provide, he argued, then RCTs are not superior to other experimental designs – which even might have more direct modes of intervening on causal variables. The only remaining question then is why RCTs are still considered the gold standard in clinical science. Is it because practitioners haven’t learned about novel results in experimental design theory?
Beyond Generalization: A Theory of Robustness in Machine Learning – Timo Freiesleben
Timo bemoaned that the notion of robustness is either woefully vague or very, very narrow in ML. For example ML models should be robust against label shifts, against adversarials etc. – where to draw the line? His aim is to give ML researchers a framework in which they can discuss all their notions of robustness more explicitly. The general aim of his proposal is clarification. So here is how he thinks we should talk about robustness: A target is robust with respect to interventions on a robustness modifier at a certain level of tolerance. Examples for robustness targets are predictions, explanations, deployment performance. Robustness modifiers can be the choice of predictor, performance metric, hyperparameters and much more. The level of tolerance is something we have to agree on before. Timo hopes that this will settle the muddle of confusion that lots of current robustness debates seem to be stuck in.