Tech

ChatGPT Is So Bad at Essays That Professors Can Spot It Instantly

GettyImages-1393274603

Since ChatGPT launched, it’s been hailed as a harbinger of doom for writers, programmers and search engines. But no single concept’s demise has been more relentlessly trumpeted than the humble student essay. The chatbot’s high-speed discharges on Jane Austen and grammatically flawless explanations of the Krebs Cycle have sent educators into conniptions over the future of text-based assessments.

Professors themselves have begun tolling the bell for the written assignment, universities are updating tests to prevent students using the chatbot, and, least surprisingly of all, Elon Musk has declared homework dead. The fanfare appears to assume that in the grubby paws of cheaters, ChatGPT’s shrewd disquisitions threaten a tsunami of unearned As.

Videos by VICE

But university professors are catching ChatGPT assignments in the wild for a different reason: because the AI-produced essays are garbage.

“The first indicator that I was dealing with AI was that, despite the syntactic coherence of the essay, it made no sense,” wrote assistant professor of philosophy at Furman University, Darren Hicks, in a Facebook post after confronting his first ChatGPT-generated essay on ‘Hume and the paradox of horror’.

For another professor who asked to remain unnamed, it was also that the essay was jarringly, clangingly wrong, that first raised their suspicions ChatGPT may have been involved. The essay, which addressed the work of critical theorist Judith Butler, “was just nonsense,” they said. It appeared to have mashed together various sources that “talked about Butler and sexuality and gender and whatever… It was a series of sentences that made their own kind of sense individually, but together made very little sense.”

Students are often wrong too, but it’s the idiosyncratic ways in which ChatGPT is botching assignments that professors are starting to recognize. They say the essays embody a constellation of traits that trigger an uncanny valley effect in the reader. Educators are starting to share these early encounters, and their clues on how to spot the ghostly imprints of ChatGPT.

“Normally, when a student plagiarizes, it’s a cost-benefit analysis that comes up because they’re desperate,” Hicks told Motherboard. They don’t know the material and they don’t have time to complete the assessment, so they ask how likely it is that they’ll get caught. “Because it’s a last minute scramble, in most cases the essays are terrible,” he said.

Not so with ChatGPT. In the case of Hick’s ChatGPT essay, “it was wrong, but it was confident and it was clearly written,” he said. “If I didn’t know the material better, it would have looked good. And that, that’s a weird combination of flags which I’d never seen before.”

There are also the tell-tale stylistic cues. “It tends to produce essays that are filled with bland, common wisdom platitudes,” said Bret Devereaux, visiting history lecturer at the University of North Carolina at Chapel Hill, who recently encountered his own first ChatGPT-created assignment. “It’s sort of the difference between ordering a good meal at a restaurant, and ordering the entire menu in a restaurant, sticking it in a blender, and then having it as soup,” he says. “The latter is just not going to taste very good.”

Then there’s the fairly important point that ChatGPT is a barefaced liar. If its habit for fabrication isn’t apparent in the essay itself, it’s likely to rear its head in the citations. The chatbot has a penchant for conjuring up entirely imagined works by sometimes fictitious authors. It’s also been known to blend the names of less famous scholars with more prolific ones. ChatGPT’s catch-22 is that you can only reliably spot its lies if you’re a subject matter expert yourself, meaning a panicked student turning to the software an hour before deadline is likely to struggle to determine what’s inaccurate.

ChatGPT doesn’t reproduce any content verbatim from its training data, meaning it flies below the radar of traditional plagiarism detection software, but the anonymous professor said some of their ChatGPT-generated essay’s phrases were easily traced back to what was probably part of its online source material. (Turnitin has integrated an AI detector, and GPTZero is a similar program from OpenAI itself, but neither are foolproof right now.)

All of these quirks arise from how ChatGPT functions. The OpenAI tool has ingested vast language datasets and learnt the probabilistic relationships between words, followed by further reinforcement learning with humans about what constitutes a “good” response to different prompts. It can produce sentences that sound correct, despite the programme having no understanding of the underlying concepts.

In a recent New Yorker piece, sci-fi author Ted Chiang described it as a “blurry JPEG of the internet” – it has impressively transformed massive volumes of data into an algorithmic black box, but in the process it’s sacrificed specificity, nuance and the guarantee of accuracy. Put in a prompt, and what you get back is a squinty approximation of the internet’s collective consciousness.

This, combined with the fact that ChatGPT can’t recognise the limits of its own knowledge, means the tool is an elegant bullshitter. Where it doesn’t know things, it plugs the gap with fluent and tangentially-related fluff. Or worse, it “hallucinates”, producing superficially coherent flights of fancy.

Behind all of this of course, looms the counterfactual. Perhaps other students are turning in ChatGPT essays that are so good, they’re flying under professors’ radars. There is, after all, a chorus of professors weirdly committed to broadcasting the tech’s triumphs over their puny assignments. But it would appear that without substantial tinkering on the part of the student, this is unlikely to be the case for the most part.

“Right now, I don’t think it’s good enough to write a college level paper,” says Kristin Merrilees, a student at Barnard College. She has heard about some students using ChatGPT on worksheet exercises, where the required responses are short and relatively simple, but doesn’t know of anyone who’s attempted a full-length essay so far. But Merrilees says she has used the software as a study aid to help summarize material around a particular topic, although it “sometimes gets things wrong”.

While the model is likely to improve further, some issues remain intractable for now. AI experts note that at present, researchers aren’t sure how to make the model more factually reliable, or cognizant of the limits of its abilities. “Grounding large language models is a lofty goal and something we have barely begun to scratch the surface of,” says Swabha Swayamdipta, assistant professor of Computer Science in the USC Viterbi School of Engineering.

To make tools like ChatGPT more reliable, companies might incorporate more human reinforcement learning, but this may also “make the models tamer and more predictable, pushing them towards being blander and having a more recognizable style,” says Jaime Sevilla, director of Epoch, an AI research and forecasting firm. You can see the difference this makes by comparing ChatGPT with its zanier cousin, GPT-3, he points out.

Professors are still grappling with what, if anything, they should do about ChatGPT. But early evidence of ChatGPT-enabled cheating suggests a rough rubric for how to make essay prompts less gameable. Questions that focus on describing or explaining topics with many words devoted to them online are squarely within ChatGPT’s wheelhouse. Think of something like, ‘discuss the major themes of Hamlet’. “That’s all over the internet – it’s going to nail that,” says Hicks.

If professors wish to cover these kinds of texts, they may have to get more creative with the kinds of questions they ask. Being more specific is one potential route: “ChatGPT has flawlessly read its library once and then burned the library,” says Devereaux. “It’s going to struggle to produce specific quotations unless they’re so common in the training material that they dominate the algorithm.”

Some professors say that because their assessments require critical thinking and evidence of learning, they’re beyond ChatGPT’s capabilities. “I’ve seen reports about ChatGPT passing this exam at this or that business school,” says Devereaux. “If ChatGPT can pass your exam, there’s probably something wrong with your exam.”

But one strand of the discourse says that ChatGPT-related disruption is inevitable; educators must down tools and surrender, usurped by the boxy maw of the AI interface.

Ethan Mollick, associate professor of innovation and entrepreneurship at the Wharton School of the University of Pennsylvania, has told students he expects them to use ChatGPT to do their assignments, and that this won’t count as cheating as long as they acknowledge where it assisted. He and others have started getting students to analyze ChatGPT-produced essays in class as part of the curriculum too.

Some professors I spoke to thought that getting students to examine ChatGPT’s output could prove a creative approach to the tech. But others were concerned this would circumvent students actually learning to write essays themselves, and that they’d miss out on the critical reflection and analysis that is a major part of the process.

“An essay in this sense is a word-box that we put thoughts in so that we can give those thoughts to someone else,” Devereaux wrote in a blog about ChatGPT. “But ChatGPT cannot have original thoughts, it can only remix writing that is already in its training material; it can only poorly copy writing someone else has already done better somewhere else.”

For his part, Hicks has threatened any student suspected of using ChatGPT will face an on-the-spot oral test. (Bad news for students who just happen to be as bland and cocky as ChatGPT.)

Given the deluge of AI-generated bilge that is already in the process of flooding the internet, Devereaux says he doesn’t really understand why ChatGPT was unleashed on the world, or whether its value will end up being a net positive.

“I’m a military historian, so as you might imagine, I’m very familiar with the many technologies that exist, but which we should not use,” he says. “We could conduct a live experiment on whether detonating 2,000 nuclear weapons would cause a nuclear winter. We shouldn’t.”