Starting this summer, submissions to the arXiv, the online server where many physicists check daily for new preprints, will be compared with the server's existing 400 000—and counting—manuscripts to check for plagiarism.

When plagiarism is suspected, the submission will be flagged, and the authors will get a message saying “your article has x% overlap with article ‘a.’ Do you really want to do this?” says Cornell University physicist Paul Ginsparg, the creator and overseer of the arXiv. The authors whose papers were copied from will not be notified.

“This will be a fun experiment,” Ginsparg says. “Will we train people to be more clever and to make more word changes? Or will there be a real change in their behavior?”

Behavior did change when University of Virginia physicist Louis Bloomfield began using software to see if his students were cheating. Checking new arXiv submissions is a good idea, Bloomfield says. “People should know it's not okay to steal. It's not even okay to publish your own stuff over and over.” After he reported students who had copied, they were prosecuted. Forty-five students either left the university or were found guilty, and three degrees were revoked. “I was immersed in seemingly endless honor trials. Two years of my life were burned up. There's a lot of trouble when you open this can of worms. Plagiarism shouldn't be tolerated, but you need a professional organization to handle the heat.”

Warning: Excessive Overlap Detected

An automated processor has identified excessive text overlap between your attempted submission [pending/0702.1001] and an earlier submission [physics/9403001] by different author(s).

You are free to override this warning, but you may wish to reconsider whether you really want to submit this article in its current form.

For further information on accepted authoring practices, see the American Physical Society ethical guidelines for authors.

Overview of next-generation double beta decay experiments

phvsics/9403001   pending/0702.1001  
Authors: B. Original and A. Coauthor   Author: I.M. Slapdash  
Matching blocks include 4768 of the document's 48184 words (9.9%)  Matching blocks include 4768 of the document's 10720 words (44.5%) 
Longest contiguous matching block is 109 words   Longest contiguous matching block is 109 words  
…  … 
In lieu of the traditional confrontation between theory and experiment, superstring theorists pursue an inner harmony where elegance, uniqueness and beauty define truth.   Instead of the traditional confrontation between theory and experiment, superstring theorists pursue an inner harmony of truth and beauty.  
…  … 
Is further experimental endeavor not only difficult and expensive but unnecessary and irrelevant? Contemplation of superstrings may evolve into an activity as remote from conventional particle physics as particle physics is from chemistry, to be conducted at schools of divinity by future equivalents of medieval theologians. For the first time since the Dark Ages, we can see how our noble search may end, with faith replacing science once again. Superstring sentiments eerily recall “arguments from design” for the existence of a supreme being. Was it only in jest that a leading string theorist suggested that “superstrings may prove as successful as God, Who has after all lasted for millennia and is still invoked in some quarters as a Theory of Nature”? … might be the sort of thing that Wolfgang Pauli would have said is “not even wrong.”   Not even a politically popular “Superstring Detection Initiative” with a catchy name like “String Wars”… 
Is further experiment not only difficult and expensive but unnecessary and irrelevant? Contemplation of superstrings may evolve into an activity as remote from conventional particle physics as particle physics is from chemistry, to be conducted at schools of divinity by future equivalents of medieval theologians.  Superstring proponents eerily recall “arguments from design” for the existence of a supreme being. Was it only in jest that a leading string theorist suggested that “superstrings may prove as successful as the geocentric universe, which has after all lasted for millennia and is still invoked in some quarters as a Theory of the Universe”, but is not even wrong?…  
phvsics/9403001   pending/0702.1001  
Authors: B. Original and A. Coauthor   Author: I.M. Slapdash  
Matching blocks include 4768 of the document's 48184 words (9.9%)  Matching blocks include 4768 of the document's 10720 words (44.5%) 
Longest contiguous matching block is 109 words   Longest contiguous matching block is 109 words  
…  … 
In lieu of the traditional confrontation between theory and experiment, superstring theorists pursue an inner harmony where elegance, uniqueness and beauty define truth.   Instead of the traditional confrontation between theory and experiment, superstring theorists pursue an inner harmony of truth and beauty.  
…  … 
Is further experimental endeavor not only difficult and expensive but unnecessary and irrelevant? Contemplation of superstrings may evolve into an activity as remote from conventional particle physics as particle physics is from chemistry, to be conducted at schools of divinity by future equivalents of medieval theologians. For the first time since the Dark Ages, we can see how our noble search may end, with faith replacing science once again. Superstring sentiments eerily recall “arguments from design” for the existence of a supreme being. Was it only in jest that a leading string theorist suggested that “superstrings may prove as successful as God, Who has after all lasted for millennia and is still invoked in some quarters as a Theory of Nature”? … might be the sort of thing that Wolfgang Pauli would have said is “not even wrong.”   Not even a politically popular “Superstring Detection Initiative” with a catchy name like “String Wars”… 
Is further experiment not only difficult and expensive but unnecessary and irrelevant? Contemplation of superstrings may evolve into an activity as remote from conventional particle physics as particle physics is from chemistry, to be conducted at schools of divinity by future equivalents of medieval theologians.  Superstring proponents eerily recall “arguments from design” for the existence of a supreme being. Was it only in jest that a leading string theorist suggested that “superstrings may prove as successful as the geocentric universe, which has after all lasted for millennia and is still invoked in some quarters as a Theory of the Universe”, but is not even wrong?…  

A warning will be sent to authors who submit a document that overlaps with other material in the arXiv. This mockup was constructed by Paul Ginsparg from an article he and Sheldon Glashow wrote for Physics Today (May 1986, page 7).

The arXiv's automated scanning for overlapping text is a refinement of an algorithm used last year by Cornell computer science graduate student Daria Sorokina to look at the server's then nearly 300 000 documents. The algorithm assigns unique numbers to word sequences and then compares those numbers across documents. Common phrases such as “this work was supported in part by” are excluded. “There is nothing new about document fingerprinting,” says Cornell computer scientist Johannes Gehrke, an adviser on the project. “The novelty here was the application to the arXiv.”

In the study, about 10% of arXiv manuscripts had text blocks that overlapped with other documents. After removing instances of authors reusing parts of their own text, different collaborators on a single project using the same text in separate conference abstracts, and other apparent false positives, less than 1% of manuscripts were still suspect, says Sorokina.

Close examination of 20 pairs of documents with among the highest levels of overlap exposed 16 as plagiarism. “In one case, an author copied descriptions of five or six methods that he was comparing,” says Sorokina. “He didn't cite the sources. But the work of comparing was his own.” One of the most common types of plagiarism found was the lifting of introductory or background material, especially in PhD theses, says Ginsparg. “The surprising thing is that people submit to the same database where they found [what they copied]. It's mind boggling, given the existence of Google, given the existence of searching on full text, that people wouldn't have an intuition that they would be caught.”

“Some of it is different ethical norms,” Ginsparg adds. “People in different countries, with different intellectual backgrounds, will sometimes argue that what they are doing is completely correct.” The reassuring thing, he adds, “is that the most creative people, who are generating the ideas, don't have to start from someone else's article as a template. we'd be very surprised if authors of prominence showed up as perpetrators as opposed to victims.”

Document fingerprinting catches only word-for-word plagiarism. But work is under way in the data-mining community on author identification and detection of the flow of ideas, says Gehrke. “Detecting content-based similarities with more sophisticated methods on a macroscale will be the next step.”

In addition to implementing a check on new submissions to the arXiv, Ginsparg is talking to the editors of Physical Review Letters about applying the method to it and other American Physical Society publications. “More work needs to be done to include papers outside of the arXiv, and to go across journals,” says Marty Blume, the recently retired APS editor-in-chief. “We have 30 000 submissions a year. We'll have to see how much [of the editors'] time it takes to run. And if we do it, what do we do with the results?”