Putting Large Language Models in Context
In recent weeks it’s been hard to avoid the buzz around the rise of Large Language Models (LLMs) and particularly the recent launch of ChatGPT. There have also been some notable failures ranging from the hilarious to the alarming. In fact we had to keep updating this post as new stories came out. So, much as we were tempted to keep going on RPC vs TCP, we’re shifting focus this week to go a bit outside our lane to look at some of the underlying issues with AI systems.
As someone whose primary field is networking, I’m not going to claim that I have deep expertise in Artificial Intelligence(AI). But in my role as regional CTO for VMware over the period from 2017 to 2020, I needed to be more of a generalist. So I sought to understand what was happening in AI and what it might mean for the technology industry as it moved into the mainstream. In this post I’m going to share some of what I learned, which is now proving helpful to process the daily onslaught of new developments in AI. I will note that there is some resistance to calling the latest round of LLM systems “AI” but that usage seems well established now.
When I was in the final year of my undergraduate electrical engineering degree, I happened to write my thesis on “Expert Systems for VLSI design”. It was a bit of a random choice from a list of topics proposed by my favorite professor, but expert systems were “hot” in 1984, representing one promising line of research in AI at the time. There was a fairly direct line from that thesis to my applying to the Ph.D. program at Edinburgh University. Edinburgh had one of the best Computer Science departments in the U.K. and one of the only AI departments in the world. What I didn’t know until I arrived in Edinburgh from Australia was that the two departments had a dim view of each other, with the CS folks viewing AI as not serious, while the AI view of CS was no better. I sat in on a couple of AI courses but pretty soon I picked my side (CS) just in time to avoid the second of several “AI Winters”.
That was about the extent of my AI knowledge until 2017, when I noticed a sort of anxiety among technical people I met about the rise of AI, including uncertainty about what it meant for the tech industry (and VMware in particular). So I started to read up on the state of the art and gather information from my co-workers who were closer to the field than I. One especially helpful article I read was "Machine Learning Explained" by noted roboticist Rodney Brooks (of iRobot fame). You should read it–it’s fun, but long, so here is a quick summary: Donald Michie, who created the AI department in Edinburgh in the 1960s, built a mechanical computer called MENACE using matchboxes, which learned how to play tic-tac-toe (or noughts and crosses as it’s called in the U.K.). It was a clever piece of design by an AI pioneer who couldn’t get access to an electronic computer in those early days. The machine really did “learn”: after every game of tic-tac-toe, it was given either positive or negative reinforcement in the form of the addition or subtraction of some colored beads into the relevant matchboxes, in much the same way that today we use training data to adjust the weights in a neural network. Eventually, with enough training, it was able to play decent but not optimal games against human players, even though no-one ever explained the rules of the game to it.
Brooks makes some good points about the similarity of this mechanical system to modern reinforcement learning systems. What strikes me is that no-one would be tempted to use the term “intelligent” or “sentient” to describe this machine. It learns to get better at a task through training, and there is something about “learning” that we associate with intelligence, but since this is just a few hundred matchboxes filled with colored beads, it’s pretty easy to conclude that there is no intelligence. There is not even a concept of “three crosses in a row”–the machine just plays moves that have been rewarded because they led to wins in the past.
Not long before Brooks wrote his article, Alpha Go (a game-playing machine using deep learning) had managed to beat the world’s best players at the game of Go, marking something of a milestone in the history of human-vs-computer competition. But Brooks suspected that the go-playing machine didn’t understand the game in the way a human does, just as MENACE didn’t understand what a row of crosses was. He asked the Deep Mind team if their machine would cope with a subtle change to the rules of Go (changing the board size) and they were quick to agree that it would not, because it had not encountered that situation in its training. That observation was effectively born out by the recent defeat of a system similar to Alpha Go. Quoting the article:
The tactics used … involved slowly stringing together a large “loop” of stones to encircle one of his opponent’s own groups, while distracting the AI with moves in other corners of the board. The Go-playing bot did not notice its vulnerability, even when the encirclement was nearly complete…
“As a human it would be quite easy to spot,” he added.
In other words, the AI system, having been trained against typical Go strategies, didn’t recognize a new strategy that would have been easy for a human to spot, because the AI system didn’t recognize what a “loop” of stones looked like.
This lack of understanding is one of the key concerns about modern machine learning systems. And it has been on display in the various failures of ChatGPT and similar LLM systems. Douglas Hofstadter had some fun getting GPT-3 to give nonsensical answers when faced with questions that most humans would have quickly dismissed as unanswerable. Just as the game-playing systems lack an understanding of the game that would be obvious to a human, the chat systems lack an understanding of the meaning behind the words they are producing.
There are some other issues that are common to Michie’s MENACE and modern machine learning systems. It is well known now that performance of these systems depends heavily on the training data that is fed in. Michie showed how MENACE learned different styles of play depending on the sort of human opponent it faced, and the Go-playing system was stumped by a style of play it had not been trained on.
It’s also worth noting how much work it took to map a simple game like noughts and crosses onto a machine learning algorithm, and Brooks makes this point that ML is not just some sort of magic that we can sprinkle on hard problems and get results out the other side. “Every successful application of ML is hard won by researchers or engineers carefully analyzing the problem that is at hand.” This makes me appreciate the hard work that has gone into making ML systems work at all and a bit less willing to believe that we are going to see all challenging human tasks taken over by machines soon.
The idea that LLMs have no understanding of the words they produce is conveyed by the term “stochastic parrots”, coined by Emily Bender et al. in this influential paper. (This is also the paper that led to Timnit Gebru being forced out of Google.) The lack of understanding in large language models is easy to lose sight of because their conversational skills are so impressive (rather more so than the tic-tac-toe skills of MENACE) but I’m persuaded by the arguments made by Bender and team (and many others), especially as they come on top of the copious examples of wrong or bizarre answers coming out of LLMs. Bender has gone on to make the case that LLMs are a really bad choice for search engines–which is interesting as Microsoft and Google seem to be racing headlong in that direction. Maybe some of the recent speed bumps, such as the wrong answer about the James Webb telescope in Google’s Bard launch announcement, will give the search giants pause. My approach, at least for now, is to treat these LLM-based systems as very large, efficient collections of matchboxes–and keep working in my chosen field of networking.
At some point I had to stop updating this post, but here are two interesting examples of LLM failures that came up after I’d finished my draft. Geocoding company OpenCage found that customers were signing up for a service they don’t offer thanks to ChatGPT writing a well-formed (but incorrect) piece of code that called their API to convert a phone number into a location. And The Guardian reported on “cursed” crochet designs, which are similar to a form of code, generated by ChatGPT. As noted elsewhere, LLMs don’t seem to do well with mathematics, which is central to crochet (obviously). Finally, the New York magazine profile of Emily Bender is a great longer read on LLMs and their limitations.
Welcome to all our new subscribers. If you’d like to try claiming the subscription fee as a business expense, here is a template to help. H/t to Gergely Orosz at The Pragmatic Engineer for the inspiration. And don’t forget to follow us on Mastodon.