Engaging in seamless conversations with AI-powered chatbots was once a far-fetched dream. Artificial intelligence as a concept has been around for a very long time, but it was not until the middle of the 2010s that we began to see systems perceivably capable of the feats science fiction had placed in our collective consciousness. When seeking to understand Generative AI Chatbots, a good place to start is in the middle of that decade. Siri and Alexa had wowed us for a fraction of a second until we realized they were not truly intelligent. Later in the decade, researchers from Google released the seminal paper “Attention Is All You Need,” introducing a new type of model called a Transformer -the first sequence transduction model based entirely on attention.
When you feed a Transformer model a ton of data, it can look back at different snippets from that data over and over again using its "attention" abilities. It's like the model is constantly comparing notes and piecing together relevant bits from the data. From these models, we found new ways of using traditional machine learning, natural language understanding, other types of AI models and eventually GPTs (General Purpose Transformers). GPTs mimic human language in a way previously unimaginable, but there was a catch. They still needed many hours of compute time to compare these tokens and so much data that researchers had to find new ways to scale the models. People like Ilya Sutskever at OpenAI and Yann Lecun at Meta lead teams that figured out some of these deeper problems and we eventually had models that could spit out convincingly erudite human language text.
A few months before OpenAI blew the doors off the building and introduced ChatGPT to the public we had a slightly different debate. Blake Lemoine, who worked for Google had spent a lot of time working with Google’s cutting-edge language model LaMDA. He had spent so much time with these kinds of models that he committed a common human error. He started placing human concepts upon the system he was interacting with, and over time convinced himself he was talking with an intelligent being and that he needed to warn the world about his findings. It’s easy to do when you spend a lot of time with these models. The model he was working with LaMDA was in many ways comparable to GPT3 the model that would go on to power ChatGPT.
LaMDA was a major leap forward for Google they heralded it in their yearly technology conference and set the stage for public interest in the model. Previous models were able to place strings of text together that made sense, but they weren’t believable once you got past a few sentences, these early models often failed to make pragmatic commonsense inferences, maintain context and had other significant problems.
By the time we got to LaMDA, the models were able to maintain a level of coherence in their output that was reasonably convincing. However, they were still following a similar principle in output. To understand this we must think about the models in their most basic parts. If we reduce what a model does to its most basic operation: a model takes text input, breaks it down into tiny parts, compares those parts in an unfathomable number of ways, then generates output based on comparing those comparisons to its training data which it has compared even more significantly over a long period of time. It then places the most probable combination of text into an output string that eventually looks like text that someone might write, and this becomes the output. Now of course this is a simplistic way of looking at the process, there are encoders and decoders and so many other factors that go into this complex operation, but if you don’t understand this concept you will not truly understand how easily we as humans can be led astray by large language models.
Another famous paper used the term "Stochastic Parroting" to define this concept. A model is in some ways like a very intense parrot, repeating things it has been able to formulaically understand about human communication without truly understanding what it is saying. There are of course people like Lemoine who try to argue that there is some sense of understanding going on within the models but to the best of most researchers’ current understanding this is not the case, and it is certainly not the case in comparison to our current theories of how human cognition works, even though it is easy to draw simplistic parallels. Thus it is tempting to think along these lines. This brings us to the title of this article. Intention is all you need. When interacting with Language models and other Generative AI applications it is tantamount that you understand your intention when interacting with the technology.
After ChatGPT was released Microsoft integrated an early version of GPT4 into its BingChat. Reporters had a field day with this. There was a famous interaction where reporter Kevin Roose was able lead Bing chat (after hours of strange conversation) into a place where it tried to convince them to break up with their significant other. When we interact with these systems we must understand our own nature. In the Lemoine example the researchers over time injected their own thoughts and desires about religion and god and technology into their interactions and led the LLM to some interesting places. The same happened with the reporter, over a significant number of interactions they led the Language model to a very strange place that we the public might have related to the movie Her.
In both of these examples, there were conscious and subconscious interactions at play on the human input side. On the Language model side it is a little less clear what was going on. We know that these models do their best to string words together that statistically make sense in response to the most recent input. We also know that for the most part they are capable of keeping some parts of the previous interactions as examples for future interactions (there is great variance among models for this, with model size, and the wrapper you connect to it through being major factors). What we end up with in these examples is something similar to what a TV court of law might call leading the witness.
It is important to note that these models are designed to be led. They require input to create output. A model able create output on it’s own with no input does not exist. With the quality of output getting ever better the AI industry has had to redefine what we think of when we say Artificial Intelligence. When you hear Dario Amodei or Sam Altman talk about the future of AI they add an extra term now. That term is General as in Artificial General Intelligence (AGI). We are still not entirely in agreement on what that really means, but for the most part we can think of it as AI that generally speaking is capable of self-awareness with the ability to solve problems, learn, and plan for the future, though some might call this super intelligence. In reality the goal is to create systems capable of some degree of intelligent autonomy.
Meta’s Yann LeCun thinks we can not get to AGI with the current models and methods of operation. Researchers tend to agree with this, but there are a few optimists out there who think we can. The next best thing we are currently calling Agentic AI. It has been said that 2024 will end with significant advancements in AI Agents who will be able to follow through on our larger intentions that may require multiple steps over time and with some amount of autonomy. Again these agents will depend on their understanding of our intention. The better we define our intentions the more likely we are to get a reasonable output. This is the most important part of understanding the current state of AI.
As we navigate the world of AI, it is crucial to remember that these models are probabilistic, and their initial outputs may be generic or contain errors. Therefore, we must clearly state our intentions and be prepared to iterate the response until we arrive at a desirable output. This process is akin to a boss assigning a new worker a project and then micro-managing for quality, regardless of how well-defined the project may seem at the beginning. It often takes several back-and-forths to further refine and evaluate the work until a presentable output is reached. Thorough review and iterative collaboration with the AI system are key to unlocking its full potential and achieving the desired results.