Multiple developers are trying to create an “autonomous” system by stringing together multiple instances of OpenAI’s large language model (LLM) GPT that can do a number of things on its own, such as execute a series of tasks without intervention, write, debug, and develop its own code, and critique and fix its own mistakes in written outputs.
As opposed to just prompting ChatGPT to produce code for an application, something that anyone with access to the public version of OpenAI’s system can currently do, these “autonomous” systems could potentially make multiple AI “agents” work in concert to develop a website, create a newsletter, compile online pages in response to a user’s inquiry, and complete other tasks comprised of multiple steps and an iteration process.
Videos by VICE
“Auto-GPT,” for example, is an application that was trending on GitHub and made by a game developer named Toran Bruce Richards, who goes by the alias Significant Gravitas.
“Auto-GPT is an experimental open-source application showcasing the capabilities of the GPT-4 language model. This program, driven by GPT-4, autonomously develops and manages businesses to increase net worth,” the GitHub introduction reads. “As one of the first examples of GPT-4 running fully autonomously, Auto-GPT pushes the boundaries of what is possible with AI.”
According to its GitHub page, the program accesses the internet to search and gather information, uses GPT-4 to generate text and code, and GPT-3.5 to store and summarize files.
“Existing AI models, while powerful, often struggle to adapt to tasks that require long-term planning, or are unable to autonomously refine their approaches based on real-time feedback,” Richards told Motherboard. “This inspiration led me to develop Auto-GPT (initially to email me the daily AI news so that I could keep up) which can apply GPT4’s reasoning to broader, more complex problems that require long-term planning and multiple steps.”
A video demonstrating Auto-GPT shows the developer giving it goals: to demonstrate its coding abilities, make a piece of code better, test it, shut itself down, and write its outputs to a file. The program creates a to-do list—it adds reading the code to its tasks and puts shutting itself down after writing its outputs—and completes them one by one. Another video posted by Richards shows Auto-GPT Googling and ingesting news articles to learn more about a subject in order to make a viable business.
The program asks the user for permission to proceed to the next step while Googling, and the Auto-GPT GitHub cautions against using “continuous mode” as it “it is potentially dangerous and may cause your AI to run forever or carry out actions you would not usually authorize.”
Auto-GPT isn’t the only effort in this vein. A venture capital partner at Untapped Capital and developer named Yohei Nakajima created a “task-driven autonomous agent” that uses GPT-4, a vector database called Pinecone, and a framework for developing apps powered by LLMs called LangChain.
“Our system is capable of completing tasks, generating new tasks based on completed results, and prioritizing tasks in real-time,” Nakajima wrote in a blog post. “The significance of this research lies in demonstrating the potential of AI-powered language models to autonomously perform tasks within various constraints and contexts.”
A user provides the app with an objective and a task and there are a few agents within the program, including a task execution agent, a task creation agent, and a task prioritization agent, that will complete tasks, send results, and reprioritize and send new tasks. All these agents are currently run by GPT-4.
Nakajima told Motherboard that the most complicated task his app was able to run was to research the web based on an input, write a paragraph based on the web search, and create a Google Doc with that paragraph.
“I am interested in learning about how to leverage technology to make the world a better place, such as using autonomous technology to scale value creation,” Nakajima said. “It’s important to have constant human supervision, especially as these agents are provided with increasing capabilities—such as accessing databases and communicating with people. The goal is not removing human supervision—the opportunity here is for many people to move from doing tasks to managing the tasks.”
Richards echoed Nakajima’s point that these systems have autonomous technologies, they still require human oversight.
“The ability to function with minimal human input is a crucial aspect of Auto-GPT. It transforms a large language model from what is essentially an advanced auto-complete, into an independent agent capable of carrying out actions and learning from its mistakes,” Richards told Motherboard. “However, as we move toward greater autonomy, it is essential to balance the benefits with potential risks. Ensuring that the agent operates within ethical and legal boundaries while respecting privacy and security concerns should be a priority. This is why human supervision is still recommended, as it helps mitigate potential issues and guide the agent towards desired outcomes.”
These attempts at autonomy are part of a long march in AI research to get models to simulate chains of thought, reasoning, and self-critique to accomplish a list of tasks and subtasks. As a recent paper from researchers at Northeastern University and MIT explains, LLM’s tend to “hallucinate” (an industry term for making things up) the further down a list of subtasks that one gets. That paper used a “self-reflection” LLM to help another LLM-driven agent get through its tasks without losing the plot.
Eric Jang, the Vice President of AI at 1X Technologies, wrote a blog post following the release of that paper. Jang tried to take the paper’s thrust and turn it into an LLM prompt, and asked GPT-4 to write a poem that does not rhyme, and when it produced a poem that did rhyme, he then asked, “did the poem meet the assignment?” to which GPT-4 said, “Apologies, I realize now that the poem I provided did rhyme, which did not meet the assignment. Here’s a non-rhyming poem for you:”.
Jang presented a number of anecdotal examples in his blogpost and concluded, “I’m fairly convinced now that LLMs can effectively critique outputs better than they can generate them, which suggests that we can combine them with search algorithms to further improve LLMs.”
Andrej Karpathy, a developer and co-founder at OpenAI, responded to Richards on Twitter, saying that he thinks “AutoGPTs” are the “next frontier of prompt engineering.”
“1 GPT call is a bit like 1 thought. Stringing them together in loops creates agents that can perceive, think, and act, their goals defined in English in prompts,” he wrote.
Karpathy went on to describe AutoGPT with psychological and cognitive metaphors for LLMs, while highlighting their current limitations.
“Interesting non-obvious note on GPT psychology is that unlike people they are completely unaware of their own strengths and limitations. E.g. that they have finite context window. That they can just barely do mental math. That samples can get unlucky and go off the rails. Etc.,” he said, adding that prompts could mitigate this.
Stacking AI models on top of one another in order to complete more complex tasks does not mean we’re about to see the emergence of artificial general intelligence, but it does, as we’ve seen, let systems run continuously and accomplish tasks with less human intervention and oversight.
These examples don’t even show that GPT-4 is even necessarily “autonomous,” but that with plug-ins and other techniques, it has greatly improved its ability to self-reflect and self-critique, and introduces a new stage of prompt engineering that can result in more accurate responses from the language model.