One Shot, Few Shot, Radical Shot

Exunoplura is back up after a sad excursion through the challenges of hosting providers. To be blunt, they mostly suck. Between systems that just don’t work right (SSL certificate provisioning in this case) and bad to counterproductive support experiences, it’s enough to make one want to host it oneself. But hosting is mostly, as they say of war, long boring periods punctuated by moments of terror as things go frustratingly sideways. But we are back up again after two hosting provider side-trips!

Honestly, I’d like to see an AI agent effectively navigate through these technological challenges. Where even human performance is fleeting and imperfect, the notion that an AI could learn how to deal with the uncertain corners of the process strikes me as currently unthinkable. But there are some interesting recent developments worth noting and discussing in the journey towards what is named “general AI” or a framework that is as flexible as people can be, rather than narrowly tied to a specific task like visually inspecting welds or answering a few questions about weather, music, and so forth.

First, there is the work by the OpenAI folks on massive language models being tested against one-shot or few-shot learning problems. In each of these learning problems, the number of presentations of the training data cases is limited, rather than presenting huge numbers of exemplars and “fine tuning” the response of the model. What is a language model? Well, it varies across different approaches, but typically is a weighted context of words of varying length, with the weights reflecting the probabilities of those words in those contexts over a massive collection of text corpora. For the OpenAI model, GPT-3, the total number of parameters (words/contexts and their counts) is an astonishing 175 billion using 45 Tb of text to train the model.

Now, there is no reason why we should expect that just scaling such systems up to more and more parameters should translate into unique new capabilities, but one can always be hopeful that quantitative size results in qualitative shifts. And it seems to here, where single presentations of sample language tasks to the model resulted in better performance on the task than smaller models previously considered. For instance, given a presentation of “Translate the English to French” followed by one pair of example English-French phrases, the system then responds to additional prompts without any gradient updates, that is, without any changes to the weights in the model. This kind of one-shot or few-shot probing of the system extends from translation to question answering, from pronominal reference to reading comprehension, from SAT analogies to even doing multidigit arithmetic, and so forth. In each case, the scale-up approach pays dividends in increasing the accuracy of the system. The performance of the system approaches human capabilities in several areas, achieving near 100% on the low digit arithmetic problem and 65% on SAT analogies. For automatically generated news articles on a topic, humans were only able to distinguish GPT-3 generated articles from human-written articles with 52% accuracy.

There are some notable limitations, though, like maintaining semantic coherence without repetition through long text generation processes. Further, bidirectional scanning is something we do when reading, looking back to try to integrate the new material into our overall mental model, but is not done in GPT-3. Some of the tasks are clearly just pre-trained from the training data set (like language translation), so only a few of the tasks really appear to demonstrate a break from record/regurgitate, but the fact that the non-regurgitate tasks work so well is encouraging.

The final discussion of the paper on societal impacts, bias, and related topics is worth reading even if one skips the bulk of the technical work. This is the part that makes Elon Musk and others wary of AI’s impact on the future.

But there is also a discussion of meta-learning approaches that offer an alternative to the massive language model methodology. Instead of learning patterns from text to a scale that no human has ever seen, in meta-learning the system is learning to learn. There are also some suspicions derived from developmental psychology that can be brought to bear and, I will add shortly, even some insights from Anglo-American Analytical Philosophy, because there must be some value to that field, just like linguistics should have something to say about language modeling. I add the latter because when I was working on statistical methods for machine translation in the 90s, some of the computational linguists complained that language modeling told us absolutely nothing about language per se. It is a valid criticism, too. GPT-3 shows us that somehow a whole bunch of weighted example contexts can do stuff with questions and prompts, but the closest it gets to actually informing us about language is when we get to the limits, like long-range pronominal resolution, metonymy, etc. where we see the limits of the approach and are forced to confront the fact that the model isn’t smart like us.

But one-shot and few-shot learning in the face of sparse data is one of the hallmarks of developmental psychology, to begin with, with clear problems in reconciling the learning rates for words with the number of contextual exposures for human children. Even acknowledging problems with the original form of the so-called “poverty of the stimulus” argument, it is apparent that people do one-shot learning and it can even be examined in terms of brain architecture. But what of some contribution from philosophy, here? Let’s take a look at Quine’s classic paper on synonymy and the analytic-synthetic distinction, Two Dogmas of Empiricism. Here we see how mental models are guided by a pliable cohesion with boundary conditions that are facts about the world. David Lewis would later examine how certain concepts become referential piers that the ships and boats of ideas anchor too. The ideas themselves are something like myths, though. They may reflect aspects of the real world, but they may also be merely pragmatic and convenient, like irrational numbers. And they rearrange, we see, from Kuhn’s ideas of scientific revolutions that came out around a decade after Quine and the final unraveling of Logical Positivism in the face of everything from the deconstruction of Carnap to Gödelian incompleteness.

We get a flavor, though, that there is a missing piece to models that build multi-length contexts. There is a non-linearity or a rapid remodeling that accompanies human learning. In one-shot events, some convenient model is reapplied but with the context ripped out, then the context is rearranged as well, until the reconciliation extends back to the previous exemplars. The new model stands on its own, perhaps, separate but somehow metaphorically related, but those connections don’t reassert until much later, or at all, in some cases. This seems to be a part of the meta-learning methodology at play, but the only approach I have seen in recent memory that has this flavor is in ideas like contextual lexicon reorganization guided by some variant of algorithmic information theory. Yet, even there, the number of reorganization steps (or remodelings) remains high and how this might be applied to one-shot learning problems like the GPT-3 evaluation sets is unclear.