Jumbly Grindrod has a recent paper arguing that LLMs produce meaningful output. He makes an argument fielding teleosemantic theories very similar to the argument advanced by Lyre. He differs from Lyre by thinking that the success of LLMs is also evidence for the distributional hypothesis, something that is not discussed by Lyre at all. I have already mentioned that I’m skeptical about the application of teleosemantics to LLMs in my post which engages with Lyre. As these arguments seem to gain some traction, maybe it is time to explain in more detail why I remain unconvinced by them.
But first I want to discuss if the distributional property can be used to salvage attempts to ascribe meaning to LLM outputs.
Distributional semantics then predicts that the proximity between two words will correspond with their meaning similarity. (this and all the following quotes are from Jumbly’s paper)
The measure of nearness is different in word2vec and transformers. In NLP tasks which use word2vec nearness is usually calculated by the euclidean scalar product between two vectors (see for example doi:10.1017/S1351324916000334). Note that this measure has to be put in by hand, you could as well try a different one. Which I’m sure someone has already tried.
Obviously for simple n-grams you can also define a measure of nearness, as n-grams are represented as vectors1 anyway. You could even use the same euclidean metric as in word2vec.
In a transformer the notion of nearness is implicitly given by the learned kernel smoothing over the context window (the attention mechanism). But prima facie it is unclear if that notion of nearness has anything to do with semantic similarity. How would you find out? Mere success (whatever that means) of a chatbot seems insufficient. But this is what Jumbly argues:
However, this argument ignores the key insight behind the vector space approach discussed earlier—that those embeddings seem to serve as good representations of word meanings across a vast range of NLP tasks related to meaning, and that this in turn serves as a partial vindication of the distributional hypothesis.
This argument doesn’t say more than: because the outputs seem okay to us, they must have meaning. The problem here is, we know that whichever notion of nearness a model learned, it diverges infinitely often from our intended notion of nearness. These are the strange errors. It is not clear at all to me how distributional semantics copes with that problem. Clearly it would be helpful if we could say that the probability distribution2 learned by the LLM is approximating the probability distribution of our language:
If an LLM is able to access facts about the meanings of particular expressions insofar as the meanings of expressions impact upon how they are distributed across a corpus, then the probability distributions for each expression will reflect that.
This statement should be made a bit more precise. The first claim is that the meaning of an expression at least partly determines the probability distribution (of words in any text, i.e. the probability distribution that generates human language). This is the distributional hypothesis. I don’t think it makes sense to speak of probability distributions for each expression. There is one distribution which can take different values for different expressions. If there are different probability distributions then we leave the IID setting of traditional language modelling. Note that any LLM is only able to access the frequency distribution of the corpus it is trained on. Any reconstruction of a probability distribution is a statistical inference (that is the result of training the LLM). So what is argued here is that:
- The frequency distribution of the training corpus reflects the meaning of the expressions; and
- An LLM can infer a good approximation of the probability distribution which generates this frequency distribution.
In many LLM evaluations goodness of approximation is not evaluated in any formal sense. What people do when they are evaluating the performance of LLMs is heuristically exploring a small part of the sample space. There are theoretical considerations about the impossibility of having robust classifiers which should caution us about generalizing heuristic performance evaluations. Evidence from controlled experiments suggests that the sample space which is generated automatically during training is often also not similar to the intended sample space of the problem3. Such considerations make me very skeptical about claims like 2..
If you have followed my skeptical line of thought so far there is an alluring symmetry. Once you claim that LLMs can produce meaningful output you will also claim that other language models (e.g. n-grams) can produce meaningful output. Jumbly wants to resist that conclusion:
But what distinguishes LLMs from, for example, a simple bi-gram model that produces the most likely word given the word that has come immediately prior? The thought here is that the LLM, through its pre-training, has been able to analyse the factors that lead to the word being reproduced as these factors serve as latent variables in the statistical analysis of the word’s distribution.
And he goes on by saying that:
[…] the bi-gram model is only able to perform a much more shallow distributional analysis that will only capture quite rudimentary facts regarding why a particular word has been reproduced.
So what this boils down to is saying, in my preferred terminology, that LLMs are hidden markov models as opposed to n-grams which are markov models. And supposedly the difference in meaning production lies in the claim that latent space represents meaning. But to substantiate that claim it is not sufficient to point at observable outputs. Unfortunately, due to the training process, we don’t have access to any explicit representation of the hidden states and their connections to observable states. We don’t know the graph of the hidden markov model. And as in the case of word2vec we can look at parts of latent space (the embeddings), but we don’t have any meaningful vector algebra there. We know that the construction of latent space depends on the frequency distribution of tokens in the training corpus, but that is about it. We don’t know if they represent the meanings we intend them to represent. And from latent space reconstruction experiments like the one mentioned above we should rather conclude that they do not represent any meanings.
- A word of caution about vector spaces: Neither the vectors from word2vec, nor from n-grams, nor the embeddings learned by LLMs form a true vector space in the mathematical sense! Meaning is not preserved under the usual vector space operations like addition and scalar multiplication. Often these operations make no semantical sense at all. See also this discussion on stackexchange. ↩︎
- Actually, we should rather speak of the probability model, because what is learned is not only the distribution but also the sample space and random variables over that sample space ↩︎
- A similar effect can be observed in the Kolmogorov construction of stochastic processes, it doesn’t give a unique sample space. ↩︎