Daphne Ippolito, a senior research scientist at Google specializing in natural-language generation, who also did not work on the project, raises another concern.
“If automatic detection systems are to be employed in education settings, it is crucial to understand their rates of false positives, as incorrectly accusing a student of cheating can have dire consequences for their academic career,” she says. “The false-negative rate is also important, because if too many AI-generated texts pass as human written, the detection system is not useful.”
Compilatio, which makes one of the tools tested by the researchers, says it is important to remember that its system just indicates suspect passages, which it classifies as potential plagiarism or content potentially generated by AI.
“It is up to the schools and teachers who mark the documents analyzed to validate or impute the knowledge actually acquired by the author of the document, for example by putting in place additional means of investigation—oral questioning, additional questions in a controlled classroom environment, etc.,” a Compilatio spokesperson said.
“In this way, Compilatio tools are part of a genuine teaching approach that encourages learning about good research, writing, and citation practices. Compilatio software is a correction aid, not a corrector,” the spokesperson added. Turnitin and GPT Zero did not immediately respond to a request for comment.
We’ve known for some time that tools meant to detect AI-written text don’t always work the way they’re supposed to. Earlier this year, OpenAI unveiled a tool designed to detect text produced by ChatGPT, admitting that it flagged only 26% of AI-written text as “likely AI-written.” OpenAI pointed MIT Technology Review towards a section on its website for educator considerations, which warns that tools designed to detect AI-generated content are “far from foolproof.”
However, such failures haven’t stopped companies from rushing out products that promise to do the job, says Tom Goldstein, an assistant professor at the University of Maryland, who was not involved in the research.
“Many of them are not highly accurate, but they are not all a complete disaster either,” he adds, pointing out that Turnitin managed to achieve some detection accuracy with a fairly low false-positive rate. And while studies that shine a light on the shortcomings of so-called AI-text detection systems are very important, it would have been helpful to expand the study’s remit to AI tools beyond ChatGPT, says Sasha Luccioni, a researcher at AI startup Hugging Face.
For Kovanović, the whole idea of trying to spot AI-written text is flawed.
“Don’t try to detect AI—make it so that the use of AI is not the problem,” he says.