In Short:
Researchers have developed a new method to estimate the usage of large language models (LLMs) in scientific writing. By analyzing excess word usage in abstracts from 2023 and 2024, they found that at least 10 percent of abstracts were processed with LLMs. The study compared word frequencies before and after LLMs were widely used, showing a significant increase in style words like verbs and adjectives. This method can help identify LLM-assisted writing.
Researchers Develop Method to Detect LLM Usage in Scientific Writing
AI companies have struggled to detect when a piece of writing was generated using a large language model (LLM). However, a group of researchers has created a new method to estimate LLM usage in scientific writing by identifying “excess words” that appeared more frequently during the LLM era in 2023 and 2024. According to the researchers, at least 10 percent of abstracts in 2024 were processed with LLMs.
Insights from the Study
In a recent preprint paper, researchers from Germany’s University of Tübingen and Northwestern University were inspired by the impact of the Covid-19 pandemic on excess deaths. They analyzed excess word usage after the wide availability of LLM writing tools in late 2022. The researchers discovered an increase in the frequency of certain style words that was unprecedented in both quality and quantity.
Delving Into the Data
The researchers examined 14 million paper abstracts published on PubMed between 2010 and 2024 to track changes in word frequency. They found that certain words significantly increased in popularity after the introduction of LLMs in 2023, such as “delves,” “showcasing,” and “underscores.”
Moreover, the researchers observed a surge in the usage of words like “potential,” “findings,” and “crucial” in post-LLM abstracts. These changes in word use were distinct from previous trends related to major world health events.
An Intricate Interplay
The study highlighted hundreds of “marker words” that became more common in post-LLM scientific writing. By analyzing these markers, researchers estimated that at least 10 percent of papers in the PubMed corpus post-2022 were written with LLM assistance. The actual number could be higher due to potential abstracts lacking identified marker words for LLM usage.