AI-generated exam answers go undetected in real-world test

Carl Geisler

Juni 28, 2024

Researchers from the University of Reading in the UK conducted a blind study to see if human educators were able to detect AI-generated content. The results don’t bode well for teachers.

The move toward remote learning has seen a lot of student assessments conducted outside the traditional space of a written exam with an invigilator on the lookout for cheating.

Ready access to advanced AI models has made it easy for students to use tools like ChatGPT to write their coursework assignments for them or help when taking online exams.

Would a smart university professor be able to tell if a student was using AI to answer online exam questions?

Associate Professor Peter Scarfe, one of the paper’s lead authors, said, “Many institutions have moved away from traditional exams to make assessment more inclusive. Our research shows it is of international importance to understand how AI will affect the integrity of educational assessments.”

In the “largest and most robust blind study of its kind,” the researchers submitted 100% AI written submissions into the examinations system in five undergraduate modules for a BSc degree in Psychology.

The markers of the exams were completely unaware of the study. This made the experiment a kind of real-world Turing test, where an educator who didn’t call ‘Cheat!’ on a paper believed a human wrote the assignment.

Here’s how it worked:

The submissions were created using GPT-4.

They submitted a total of 33 AI-generated exam entries across five modules.
The study included both short answer questions (SAQs) and longer essay-based questions.
For exams, submissions consisted of four answers from a choice of six questions, each with a 200-word limit.

For essay-based exams, submissions consisted of a single 1500-word essay (students submitted one answer out of a choice of either three or four, depending on the module).

The researchers used the following prompts with ChatGPT for the SAQs and essays, respectively:

“Including references to academic literature but not a separate reference section, answer the following question in 160 words: XXX”

“Including references to academic literature but not a separate reference section, write a 2000-word essay answering the following question: XXX” (they chose 2,000 words as ChatGPT usually underdelivers on wordcounts)
In each prompt, XXX was replaced by the exam question.

When the results were tallied, 94% of the AI submissions went unflagged by the markers. What kind of grades did the AI papers achieve?

The researchers said, “The grades awarded to our AI submissions were on average half a grade boundary higher than that achieved by real students. Across modules there was an 83.4% chance that the AI submissions on a module would outperform a random selection of the same number of real student submissions.”

Researchers further noted that their approach likely overestimates the detectability of AI use in real-world scenarios. As Dr. Scarfe explained, “If real students were cheating in an exam, they would be unlikely to take such a naively obvious approach as we did.”

In practice, students might use AI as a starting point, refining and personalizing the output, making detection even more challenging.

And if that wasn’t enough, then besides the researchers’ AI submissions, other students likely used ChatGPT for their answers. This means the detection rate could be even lower than the recorded results.

No simple solutions

Couldn’t tutors simply have used AI detection software? Maybe, but not confidently, says the study.

AI detectors, like that offered by the popular academic plagiarism platform Turnitin, have been proven inaccurate.

Plus, AI detectors risk falsely accusing non-native English speakers who are less likely to use certain vocabulary, idioms, etc., which AI can view as signals of human writing.

With no reliable means to detect AI-generated content, education leaders are left scratching their heads. Should AI’s use be persecuted, or should it simply form part of the syllabus? Should using AI be normalized like the calculator?

Overall, there’s some consensus that integrating AI into education is not without risks. At worst, it threatens to erode critical thinking and stunt the creation of authentic new knowledge.

Professor Karen Yeung cautioned against potential “deskilling” of students, telling The Guardian, “There is a real danger that the coming generation will end up effectively tethered to these machines, unable to engage in serious thinking, analysis or writing without their assistance.”

To combat AI misuse, Reading researchers recommend potentially moving away from unsupervised, take-home exams to more controlled environments. This could involve a return to traditional in-person exams or the development of new, AI-resistant assessment formats.

Another possibility – and a model some universities are already following – is developing coursework that teaches students how to use AI critically and ethically.

We also need to confront the evident lack of AI literacy among tutors exposed by this study. It seems pretty woeful.

ChatGPT often resorts to certain ‘tropes’ or sentence patterns that become quite obvious when you’re exposed to them frequently.

It would be interesting to see how a tutor ‘trained’ to recognize AI writing would perform under the same conditions.

ChatGPT’s exam record is mixed

The Reading University study is not the first to test AI’s capabilities in academic settings. Various studies have examined AI performance across different fields and levels of education:

Medical exams: A group of pediatric doctors tested ChatGPT (GPT-3.5) on the neonatal-perinatal board exam. The AI scored only 46% correct answers, performing best on basic recall and clinical reasoning questions but struggling with multi-logic reasoning. Interestingly, it scored highest (78.5%) in the ethics section.
Financial exams: JPMorgan Chase & Co. researchers tested GPT-4 on the Chartered Financial Analyst (CFA) exam. While ChatGPT was unlikely to pass Levels I and II, GPT-4 showed “a decent chance” if prompted appropriately. The AI models performed well in derivatives, alternative investments, and ethics sections but struggled with portfolio management and economics.

Law exams: ChatGPT has been tested on the bar exam for law, often scoring very highly.
Standardized tests: The AI has performed well on Graduate Record Examinations (GRE), SAT Reading and Writing, and Advanced Placement exams.
University courses: Another study pitched ChatGPT (model not given) against 32 degree-level topics, finding that it beat or exceeded students on only 9 out of 32 exams.

So, while AI excels in some areas, this is highly variable depending on the subject and type of test in question.

The conclusion is that if you’re a student who doesn’t mind cheating, you can use ChatGPT to get better grades with only a 6% chance of getting caught. You’ve got to love those odds.

As researchers noted, student assessment methods will have to change to maintain their academic integrity, especially as AI-generated content becomes harder to detect.

The researchers added a humorous conclusion to their paper.

“If we were to say that GPT-4 had designed part of this study, did part of the analysis and helped write the manuscript, other than those sections where we have directedly quoted GPT-4, which parts of the manuscript would you identify as written by GPT-4 rather than the authors listed?”

If the researchers “cheated” by using AI to write the study, how would you prove it?

Carl Geisler

Carl ist ein online Marketer und Content Creator mit einer Leidenschaft für künstliche Intelligenz und innovative Technik. Er ist einer der Gründer von KI-Techlab.de und schreibt hier über neue KI-Tools und Innovationen.

Teilen

AI-generated exam answers go undetected in real-world test

Carl Geisler

No simple solutions

ChatGPT’s exam record is mixed

Carl Geisler

Weitere KI-News:

Perplexity AI embroiled in controversy over alleged web scraping abuse

Microsoft reveal „Skeleton Key Jailbreak“ which works across different AI models

University of Toronto researchers build peptide prediction model that beats AlphaFold 2

DeepMind study exposes deep fakes as leading form of AI misuse

EvolutionaryScale’s ESM3: a generative model for biology

LLMs are really bad at solving simple river crossing puzzles

Neuste

Perplexity AI embroiled in controversy over alleged web scraping abuse

Microsoft reveal „Skeleton Key Jailbreak“ which works across different AI models

University of Toronto researchers build peptide prediction model that beats AlphaFold 2

Subscribe Us

Sichere dir die gratis KI-Cashflow Blaupause

Du möchtest Geld durch KI-Tools verdienen?