Keeping Content Honest with Plagiarism Detection

The growth of the Web has considerably lowered the barrier to gathering and sharing information. It has also made it far simpler for students and Websites to deliberately or accidentally copy text without proper citation or licensing. To address these concerns a wide variety of so-called plagiarism detection tools have emerged to help match new content with existing bodies of text. While they are not perfect, they do promise to help keep students and contributors honest, and to help businesses identify and rectify infringements of their work.

Leading plagiarism detection tools in the academic space include Ephorus, PlagScan, Turnitin, URKUND. In the business space, leading tools for detecting and preventing plagiarism include Copyscape, Distill Networks, PlagScan, and iThenticate.

Jonathan Bailey, CEO and Consultant at CopyByte

Plagiarism detection tools are really more matching text detection tools, they don’t really detect plagiarism, just identical text, said Jonathan Bailey, CEO and Consultant at CopyByte, a plagiarism consultancy service. That’s very useful and very impressive, but a human has to make the determination if a work is plagiarized or not.

The problems around plagiarism have been gathering significant publicity over the last couple of years. John Walsh, a senator from Montana dropped out of his re-election campaign after he was accused of plagiarizing his master’s thesis. In Germany, a number of politicians and government leaders, including the Minister of Education were forced to step down after they were accused of plagiarism. Many of these cases were identified by a manual review process organized through Wikis such as GuttenPlag and VroniPlag.

The first plagiarism scanning tools emerged in late 1997 with Turnitin followed by Copyscape in 2004. As more and more users got online, people decided it was a good idea to check for plagiarism, said Markus Goldbach, CEO of PlagScan. Now there are over 45 different services doing some version of plagiarism detection. This was just a logical consequence of the growth of Internet published content and data sharing. The business value of plagiarism detection services are starting to draw note from big investors. In 2014 iParadigms, the company behind Turnitin was bought by Inside Ventures for $752 million. The site is used by over 25 million students worldwide.

The technology has come an amazingly long way over the past ten years or so, said Bailey. We have both high end tools, like Turnitin/iThenticate that offer access to specialized databases and tools like translated text matching and we have less expensive solutions like Copyscape and even Google itself that offer cheap, effective detection at a very low cost. The improvement of plagiarism detection technology has been parallel to the improvement of search engine technology as the two use fundamentally the same principle, find matching text in a large database of content. Because of that, we’ve come a very long way.

Different Approaches Required

Blog owners and academic institutions typically leverage plagiarism detection tools in different ways and for different goals:

Academic

Students check their own work to make sure they comply with citation rules.
Teachers can scan student papers to identify plagiarism.

Business

Businesses can use plagiarism detection tools as a quality management tools to ensure the uniqueness of content and address legal copyright concerns.
After something has been published, businesses can also use plagiarism detection tools to ensure that their content stays unique. These tools make it easier to measure and document abuse of their works by competitors.

As Mikkel Preisler, CEO at Seofabrikken.dk noted, “My company sells its product by promising unique content. By attaching PlagScan reports I guarantee this. I weekly catch some SEO-copywriters inadvertently making internal duplicate content within their articles.”

Book publishers are also starting to leverage plagiarism detection tools to ensure the quality of textbooks, which can be very costly to recall if plagiarism is found after publications. Recalling a single title can set a publisher back hundreds of thousands, even millions of dollars, said PlagScan’s Goldbach.

Goldbach also sees a role for similar technology to help improve the workflow in businesses and the legal industry, which need to cross reference multiple contracts, laws, and regulations for compliance purposes. In these cases, the goal is to find changes in content from year to year rather than similarities. For example, a lawyer in the fire protection industry typically has to manually compare 100+ page legal documents to find the one line that says they need to be worried about changes like new lining requirements for hot water heaters. An automated approach based on plagiarism scanning techniques can reduce the time required to identify and document these changes by 25-30% said Goldbach.

Getting to the Source

A key component of plagiarism technologies is the source information required to identify plagiarism has occurred. Content lifted from the Internet is relatively easy to find using popular search engines and their associated APIs. Other types of content may be more difficult to cross reference against.

Content from private academic papers requires some kind of collaborative database, such as the one leveraged by Turnitin, which allows teachers and students to cross reference new papers against ones that have already been submitted. Academic and scientific research papers are often stored behind expensive pay walls and poor APIs that make it difficult to cross reference new papers against these. Copyrighted academic book content is likewise not easily available through basic Internet search engines.

“The technology can be great, but if it does not have access to the sources, then you have no chance of finding plagiarism,” said Goldbach. It is about crawling the Internet on the one hand, and getting agreements from publishers to access their data. You also might need to collect student data, or ask Universities to provide data regarding exams or papers people have submitted from previous years.”

Making the Complex Simple

Right now the biggest challenges for assessing plagiarism involve getting people to actually search for it regularly, said CopyByte’s Bailey. “Whether it’s teachers, editors or other supervisors, most of the time plagiarism slips through the cracks. It’s not because the tools were inadequate, it’s because the people in charge didn’t search for it early and often enough.”

Another big challenge in simplifying the workflow involved in plagiarized texts lies in managing the findings in a way that makes the process easier to identify potential infringement. A Google search might turn up thousands of documents, but there needs to be a way to efficiently find documents to compare against in detail. This requires semantic analysis of the text to identify similar passages and a user interface for presenting these to a user as part of their workflow.

Another challenge lies in presenting results without overwhelming uses with false positives. For example many words and phrases such as “in my opinion” are quite common, and no one would ever accuse an author of plagiarism. Also if something is quoted with proper citation, then a tool might highlight this with a different color so that a reviewer could see that this content was properly cited in the new article.

“Users don’t want to waste their time with tedious reports,” said Goldbach. One approach used by PlagScan is to highlight results directly within the document using different colors that allow a user to identify content, and then quickly see links to other sources, and specific passages that bare a similarity. Deseret Digital Media has been using PlagScan as part of its content aggregation service used by some Gannet Publications, Blaze.com, and other publishers to collection content from thousands of individual contributors.

Jacob Hancock, Software Project Manager at Deseret, said the challenge is to make sure the contributed content is original. They have not run into many cases of outright plagiarism, but have had instances where contributors were not sure what was ethical or right about taking and correctly citing information from other sources. He said, “There are 100 shades of grey when it comes to plagiarism. The worst thing you could do is to hurt your reputation.”

A Hard Problem to Solve

Debora Weber-Wulff professor for Media and Computing at the HTW Berlin said that it is important to clarify that these tools are not technically “plagiarism detection tools,” as plagiarism is a very varied and multi-faceted thing. The current tools available are merely “text parallel identification systems.” She said they only identify plagiarism that very stupid students submit (or simplistic bots produce) by copy & pasting of large blocks of text from sources easily found or already in their database. If the software does not find anything, that doesn’t mean that it is plagiarism free. It just means that nothing was found.

“Students are becoming ingenious in ways to swap words, change word order, etc, in an effort to fool the systems,” said Weber-Wulff. “But it is still plagiarism, as there is nothing of one’s own there. In science and the humanities we must strive for honesty about who contributed what and who wrote what. The software can perhaps be useful in finding duplicate text that a bot has dumped on an SEO link farm, but only if it has crawled a good portion of the Internet.”

She adds that these tools don’t identify all plagiarisms, and probably ever well.

The software is a tool: you can try and drive that nail in with a screwdriver and maybe it works. If it doesn’t, that doesn’t mean that it is impossible to drive the nail in.

PlagScan’s Goldbach agrees with Weber-Wulff that plagiarism detection tools will never reach a level where they beat a human. But adds, “When you want to fixate a screw, you can always do this with a screwdriver and your hands. However, if you assemble a closet and there are dozens of screws to attach a portable electric drill might come in handy. For me a plagiarism checker is the portable electric drill. You need the human to hold the tool and to navigate it. But it certainly improves your productivity in getting the job done.”

Plagiarism Versus Copyright Infringement

It is important to note that while outright plagiarism in the academic world is grounds for sanctions according to institutional rules, plagiarism is not technically illegal in and of itself in the context of business, even though it may be unethical. In the commercial realm, the legal landscape is defined by copyright law, which can sometimes have far reaching interpretations. For example, a European court found that 11-word snippets could be considered copyright infringement in one case.

Webber-Wulff said, “In Germany there is no legal definition of plagiarism. It is purely an academic term. In the USA case law there are cases that deal with plagiarism, but part of each case is defining what is exactly meant by plagiarism.”

Plagiarism might fall under fraud statutes as well, said CopyByte’s Bailey. “You can defraud consumers or, in the case of researchers taking advantage of government grants, defraud the government, potentially out of millions.”

More complex cases in which bots lift large chunks of content that is rewritten and published on other Websites is a murky area at the moment, noted Bailey. He explained,

Under copyright law, a derivative work is a work that is based upon an original. Scrambling content on a site may defeat plagiarism checkers (and create gibberish) but can also be infringing in some cases. Defeating plagiarism detection tools does not mean you haven’t committed copyright infringement as elements beyond the text itself can be copyrighted. In short, if it can be shown that the new work is based upon the original and that it used enough of the work to be an infringement, it likely still is.

However, plagiarism can still hurt Websites even if no copyright violations have occurred. Goldbach notes that leading search engines like Google often reward sites with original content. Those that are similar to others receive lower rankings.

Since doing plagiarism detection and citation analysis is a rather complex field, it could be another ten years before it is done well said Goldbach. There are difference in what needs to be highlighted in different academic disciplines such as math and theology and in the needs of businesses. In the end, people are likely to think more in terms of semantic analysis for a wide range of functions including plagiarism detection, opinion analysis, and understanding the impact of different media campaigns. “There will be solutions for each and every application,” he said.

George Lawton has been infinitely fascinated yet scared about the rise of cybernetic consciousness, which he has been covering for the last twenty years for publications like IEEE Computer, Wired, and many others. He keeps wondering if there is a way all this crazy technology can bring us closer together rather than eat us. Before that, he herded cattle in Australia, sailed a Chinese junk to Antarctica, and helped build Biosphere II. You can follow him on the Web and on Twitter @glawton.

There is 1 comment

October 17, 2014 #

JoepJ

Interesting article!

Did you notice the new development in the plagiarism scanning market? iParadigms acquired Ephorus a few weeks ago. They are quite big now… http://turnitin.com/en_us/about-us/press/iparadigms-acquires-ephorus

FYI: The Ephorus check is also available for students via this link: https://www.scribbr.com/ephorus-plagiarism-check/