The buzz about ChatGPT and related efforts has been surprisingly resistant to the standard deflationary pressure of the Gartner hype cycle. Quantum computing definitely fizzled but appears to be moving towards the plateau of productivity with recent expansions of the number of practical qubits available by IBM and Origin in China, as well as additional government funding out of national security interests and fears. But ChatGPT attracted more sustained attention because people can play with it easily without needing to understand something like Shor’s algorithm for factoring integers. Instead, you just feed it a prompt and are amazed that it writes so well. And related image generators are delightful (as above) and may represent a true displacement of creative professionals even at this early stage, with video hallucinators evolving rapidly too.
But are Large Language Models (LLMs) like ChatGPT doing much more than stitching together recorded fragments of texts ingested from an internet-scale corpus of text? Are they inferring patterns that are in any way beyond just being stochastic parrots? And why would scaling up a system result in qualitative new capabilities, if there are any at all?
Some new work covered in Quanta Magazine has some intriguing suggestions that there is a bit more going on in LLMs, although the subtitle contains the word “understanding” that I think is premature. At heart is the idea that as networks scale up given ordering rules that are not highly uniform or correlated they tend to break up into collections of subnetworks that are distinct (substitute “graphs” for networks if you are a specialist). The theory, then, is that the ingest of sufficient magnitudes of text into a sufficiently large network and the error-minimization involved in tuning that network to match output to input also segregates groupings that the Quanta author and researchers at Princeton and DeepMind refer to as skills. Skills might include the ability to use metaphor, irony, self-serving bias, etc. That these skills can be combined together in functional ways demonstrates more than just stochastic parroting of input texts when applied to novel queries.
The work is a fascinating extension of new ideas about how LLMs might be organized (“might” because they are mostly giant black boxes we are experimentally probing) with old ideas about informational physics from concepts like power laws and fractals. The notion of inferring modularity has also been an ongoing theme in evolutionary computing both because of a recognition that biological systems are built of modular components as well as a realization that solving complex problems requires complex machines. It’s intriguing to see a kind of inversion of the evolutionary model applied in the case of LLMs, where the system begins with random connectivity and optimizes towards a modular architecture, versus using random variation and growth to optimize towards something similar.
Can it go further? After all, human cognition has been theorized to be modular. Neural architectures certainly are. But is it possible to use text alone to infer more than just stuff like irony, an inherently language-specific skill? I’ve previously argued (also, also) that there may be basic barriers to advanced competence (and sentience) because physical understanding is required for metaphors and other types of core features of us, and these are not available for shaping the necessary systems when just induced by our textual renderings that are gobbled up by LLMs. Still, it may be too early to think we are heading towards a Gartner disillusionment. There is still plenty of modular hallucination left to explore.