Do LMMs really train themselves?

Recently Holger Lyre presented his paper “Understanding AI”: Semantic Grounding in Large Language Models in our group seminar.

And while I generally remain skeptical about his claims of semantic grounding (maybe the occasion for a separate post) here I want to address a misunderstanding in his paper about what he calls “self-learning”, “self-supervised learning” or “self-training”. So what does Lyre mean by self-learning? He quotes Bengio et al. who say it is “a form of prediction or reconstruction […] which is training to “fill in the blanks” by predicting masked or corrupted portions of the data.” (p.4) There is a bit of ambiguity here because most LLMs – BERT is a notable exception – are trained on the next-token prediction task (see for example https://arxiv.org/abs/2207.09238) which in statistical language means they are autoregressive models. And nobody in statistics would say that those models do anything themselves.

In any case, if we take Bengio et al. at face value this means that GPTs wouldn’t qualify as self-learning – they are autoregressive, not masked. But probably we shouldn’t take them at face value and cut some slack. After all their statement is from a popular article giving an overview about their collective research of the past decades. On the other hand it could be argued that the masked language modelling task is a generalization of the next-token prediction task. Who says that the blanks in your text have to be at the end? Be that as it may, even if we see self-learning as masked language modelling, for Lyre it has a more fundamental significance. He thinks it is through self-learning that LLMs get their semantical grounding. He says: “Overall, the non-trivial finding is that LLMs are able to extract extensive knowledge about causal and other regularities of the world from vast amounts of textual data through their self-training. In other words, LLMs develop world models by extracting world structure from training data. This gives the systems an indirect causal grounding.” (p.13)
Here I think he has been misled by semantically overloading the world “self” as used in self-learning by ML practitioners. They think self-learning is just masked language modelling while Lyre takes a step further and claims that it is self-learning that extracts “causal and other regularities” from the training data. How does he arrive at such strong a claim? One hint is his contrasting of self-learning with other supposedly non-self techniques. Comparing DeepBlue a non-self model to MuZero he says that “[t]his success [of DeepBlue], however, was largely based on preimplemented heuristics and brute-force search in the combinatorial tree. In contrast, MuZero’s functional grounding is based on a generative model developed via self-learning, the essential feature of any true generative AI.” I take this to mean that self-learning does something different than the clever heuristics and approximations used in DeepBlue. But what? If we define self-learning as the type of learning that extracts causal correlations we cannot use it to claim indirect causal grounding. That would be circular. If we stay within the masked language modelling scheme, self-learning remains another clever optimization technique. Unfortunately Lyre’s paper doesn’t offer a compelling answer. So why do I claim that Lyre has been misled by overloading the term self? In the discussion after his talk it became clear to me that he thought the transformer itself adjusted its learning goal according to a decision rule it gave itself – hence the term self-learning. He seemed to think that, depending on the success or failure of the predictions it made, it would change it’s decision rule and it’s weights to make better predictions. That this is not the case even for masked language modelling can be readily seen from the pseudo-code of Algorithm 12 in https://arxiv.org/abs/2207.09238 or if you prefer prose from the description at https://huggingface.co/blog/how-to-train. The mask is generated randomly and its probability distribution is a hyperparameter of the algorithm. It is therefore as much a pre-implemented heuristic as are the ones of DeepBlue.

One might say that the training process of LLMs contains additional steps above training the foundational model – they for example contain all the fine-tuning by reinforcement learning from human feedback. Although I don’t believe anything amounting to a strong notion of self-learning happens there, I luckily don’t have to answer this objection. Lyre himself thinks that “[i]t is mainly the first step [the training of the foundational model], where LLMs acquire a crucial amount of world knowledge in terms of law-like regularities that the systems find and extract from the massive amounts of text data.” (p.14) And as I have already argued no self learning in this sense happens at the level of foundational models anyway.

We philosophers should be very careful with identifying our semantically thick concepts with the perhaps innocuous technical terms of the special sciences – especially with the very suggestive notions deployed in AI and ML. On a positive note at least analytic philosophers will be delighted that there is lots of opportunity for conceptual clarification.

Leave a comment

Your email address will not be published. Required fields are marked *