It’s an open secret that AI isn’t very artificial or intelligent on its own. Machine-learning systems typically depend on low-paid crowdworkers for training and fine-tuning. Now, according to researchers, AI is poised to take their jobs.
In a new paper, political science researchers from the University of Zurich found that ChatGPT could outperform crowd-workers who perform text annotation tasks—that is, labeling text to be used in training an AI system. They found that ChatGPT could label text with more accuracy and consistency than human annotators that they found on Mechanical Turk, an Amazon-owned crowdsourcing platform, as well as trained annotators such as research assistants.
Videos by VICE
The researchers tasked ChatGPT with a sample of 2,382 tweets and asked it to classify the text by its relevance, topic, stance, problem or solution framing, and policy framing. The researchers concluded that, by using ChatGPT, they were able to have higher accuracy and intercoder agreement, which means the percentage of tweets that were assigned the same label by two different ChatGPT runs. Perhaps most importantly, they found that they could save money by using ChatGPT—using AI was twenty times cheaper than paying humans on Mechanical Turk, who already make as little as 5 cents per annotation.
This study adds to the larger conversation around how jobs will be impacted by rapidly advancing AI language models, such as OpenAI’s GPT series. Researchers from OpenAI argued in a recent paper that 80 percent of the U.S. workforce could have at least 10 percent of their tasks affected by the introduction of GPTsr. However, automating human annotators is especially grim because this was already a precarious population of workers—an outsourced labor force that performed rote tasks for Big Tech companies for pennies.
While large companies like Google and Microsoft have been boasting about their technological progress and speed in the AI sector, the reality is that all of their AI models rely on tedious, low-paid human labor.
Tech companies use tens of thousands of workers to manually label and filter content from AI models’ datasets. This is because AI is often not yet able to recognize the nuances of an image, especially when it is still being trained. Time Magazine reported in January that OpenAI, the creator of ChatGPT, paid Kenyan workers less than $2 an hour to make its chatbot safer to use. The workers were regularly faced with traumatic, NSFW content, including graphic text about child sexual abuse, bestiality, murder, suicide, and torture.
Even once AI models are deployed, they rely on human-user interaction to fine-tune and identify the shortcomings of the models.
Replacing crowdworkers who train AI with AI itself not only fails to address the terrible conditions that workers have already been exposed to, but also takes away work that provides livelihood for many people. Krystal Kauffman has been a Turker (as MTurk workers refer to themselves) for the last seven years and currently helps lead Turkopticon, a non-profit that advocates for Turkers’ rights. She told Motherboard that the organization’s Turkers don’t trust that ChatGPT’s abilities can replace theirs.
“The first thing we noted is that ChatGPT is constantly learning and changing. If the data was run on GPT-4, would the same results come up? Would the results be different in one year after countless additions to the data set? What sources train the models? In addition, we noted that the study was run earlier this month which demonstrates a lack of peer review,” Kauffman told Motherboard, on behalf of Turkopticon. “ChatGPT generates text but a human still has to read it to decide if it is GOOD text. (i.e. doesn’t contain violence or racism.)”
“Writing is about judgment, not just generating words,” she added. “Currently and for the foreseeable future, people like Turkers will be needed to perform the judgment work. There are too many unanswered questions at this point for us to feel confident in the abilities of ChatGPT over human annotators.”
The researchers agreed that it is too early to tell the extent to which ChatGPT can actually replace a human workforce. “Our paper demonstrates ChatGPT’s potential for data annotation tasks, but more research is needed to fully understand ChatGPT’s capacities in this area. For example, our tests used tweets in English and were performed on a relatively limited number of tasks. It will be important to extend the analysis to more kinds of tasks, types of data, and languages,” paper co-author Fabrizio Gilardi told Motherboard in an email.
The GPT models have already been proven to be unreliable in many cases. Microsoft researchers released a paper in which they listed the number of limitations of the GPT-4 model, including that it makes up facts that it hasn’t been trained with, has trouble knowing if its guessing or confident, can’t verify if content is consistent with training data, and is very sensitive to the framing and wording of prompts. (Despite this, they claimed it showed “sparks” of general intelligence).
“We see two reasons why ChatGPT performed well in our tests, despite its tendency to hallucinate. First, our tasks do not involve the generation of detailed text, but of short, very specific answers. These answers may well be wrong, but are not made up in the same way hallucinations are,” Gilardi said. “Second, we are not interested in one single output for each prompt; instead, we aggregate a large number of answers. ChatGPT does make mistakes, but on average it tends to provide accurate responses in our test.”
The paper states that the accuracy of the annotations was measured based on the trained human annotators, highlighting that the tasks are still reliant on human benchmarks and oversight. The solution then shouldn’t be to relegate all tasks to ChatGPT, but to see how it can be used as part of an overall process that still involves human supervision and interaction.
“We need to think seriously about the human labor in the loop driving AI. This workforce deserves training, support and compensation for being at-the-ready and willing to do an important job that many might find tedious or too demanding,” Mary L. Gray and Siddharth Suri, authors of the book Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass, wrote in a 2017 article for the Harvard Business Review. Gray and Suri recommended a number of steps that included providing workers with education opportunities that allow them to contribute to AI models beyond labeling, and to improve their conditions and wages.