bnet

FindArticles > College Student Journal > Sept, 2005 > Article > Print friendly

Using the Google search engine to detect word-for-word plagiarism in master's theses: a preliminary study

Mark McCullough

The effectiveness and efficiency of the Google Search engine for detecting potential occurrences of word-for-word plagiarism in master's theses was investigated. 210 electronic master's theses from a sample of 260 completed in 2003 were examined. Undocumented phrases from each thesis were searched against the World Wide Web using the Google search engine. Exact phrases from each thesis were searched for 10 minutes. Matches--or potential occurrences of plagiarism were found in 27.14% of the theses searched. Matches were found on or before the first numbered page in 16 of the 57 theses containing suspect passages. The average time for finding a match was 3.8 minutes. The results show that the Google search engine can be used to effectively and efficiently detect potential occurrences of plagiarism in some master's theses. The method described in the study could be used by theses advisors and other faculty as an alternative to anti-plagiarism software packages. Further investigation is needed to determine whether Google's effectiveness is consistent across varied academic disciplines. Comparative studies of Google and anti-plagiarism software and services are needed as well.

**********

The purpose of this research was to explore Google's potential for detecting occurrences of word-for-word (1) plagiarism in master's theses. The authors sought answers to these questions:

1. Is Google an effective tool for detecting plagiarism in master's theses?

2. Is Google an efficient tool for detecting plagiarism in master's theses?

The first question relates to the nature of graduate research and the types of resources on the World Wide Web. Graduate level research in most academic disciplines requires extensive use of professional journals and monographs. Some of these materials are not available in electronic formats; those that are distributed electronically are often subscription-based and not freely available on the World Wide Web. Hence, it was unknown whether Google searches would retrieve sources of plagiarized material in Master's theses (this research was conducted prior to the release of Google Scholar in November 2004). The second question stems from the authors' interest in determining whether Google might provide a relatively fast 10 minutes or less--mechanism for thesis advisors interested in checking suspect passages of a thesis draft.

The process of using search engines and periodical databases to detect plagiarism in student papers has been described by others (e.g. Ryan, 2000; Lathrop & Foss, 2000; Marshall, 1998). However, most published material on this topic is anecdotal and focuses on plagiarism in high school and undergraduate student papers. A literature search produced no studies on the effectiveness of Google or other search engines for detecting plagiarism in master's theses.

Some universities are investing in anti-plagiarism software and services such as Turn-It-In to combat academic dishonesty. Plagiarism detection services typically require student papers to be submitted to professors in electronic format. Professors then submit the papers to the software company which runs the paper against its own database of online resources. The professor then receives reports from the company detailing which papers appear to contain plagiarism. While plagiarism detection software and services offer many benefits, they are not free. Moreover, some institutions are reluctant to use plagiarism detection software and services due to concerns about students' intellectual property and privacy rights--particularly since some companies add the content of submitted papers to their database. This practice raises concerns, even though companies such as Turn-It-In pledge to protect the content of submitted papers and do not make it available to customers (http://www.turnitin.com/static/legal/legal_document.html).

The consequences of plagiarism for students and institutions, the increased availability of graduate theses, and the need for alternatives to commercial plagiarism detection software prompted this investigation. A mechanism for detecting plagiarism in theses drafts, prior to their final submission, would be beneficial to theses advisors and other graduate faculty. More importantly, it could be part of the process of educating graduate students about plagiarism before their theses are submitted. The aim of this project was to determine whether the Google search engine is an effective and efficient tool for detecting plagiarism in master's theses. We selected Google since it is currently the largest search engine available (http://searchengineshowdown.com).

Method

We considered several different approaches to the project, each of which presented various problems. How should we define an occurrence of plagiarism? What constitutes a potential occurrence of plagiarism (7 consecutive words, one sentence, two sentences)? How or should we account for varying academic disciplines within the sample set? What portions of the theses should we search? How much time should we spend searching? Initially, we considered checking every sentence of equivalent portions (e.g. introductions) of every thesis against Google, but determined this would be too time-consuming. Furthermore, we viewed several theses and determined that their content and organization varied greatly between disciplines and institutions. Initially we thought it was important to define an occurrence of plagiarism by establishing criteria for phrase lengths or number of sentences, but decided that such an approach was too restrictive. We did not want our method to exclude possible occurrences of plagiarism simply because of minor alterations--i.e., the substitution of an acronym for a noun. Nor did we want an approach that erroneously detected phrases, as might have happened if we selected a rigid method based on phrase length or number of sentences. Ultimately we settled on a more flexible approach. We decided to establish a time limit for searching each thesis since the efficiency of Google was one of the two questions we sought to answer. Instead of defining plagiarism by the extensiveness of copied text, we decided to approach the searching as an actual thesis advisor or committee member might: selecting any suspect phrase, from any section of the thesis.

There were several approaches we could have taken with regards to our sample. We wanted to test Google on theses from a variety of disciplines and institutions. However, we wanted to avoid the expense of ordering dozens of theses through our university's interlibrary loan service. We decided to select only electronic theses--available for viewing on the web. We knew this approach would significantly limit the number of institutions in our sample, but we hoped it would still be representative enough to be meaningful. We understood that checking for plagiarism in electronic theses--at least in cases where the theses files allowed copying and pasting, would allow for faster checking than text-based theses. We also wondered whether institutions that mounted student theses on the web, because of the wide dissemination, might screen theses for plagiarism more aggressively than other institutions, thereby affecting the results. We believed that by using electronic theses we would encounter more incidences where the thesis was plagiarized by others. However, we believed this could be minimized by selecting recently submitted theses.

In May 2004 we searched the OCLC Worldcat database for English-language, web-accessible master's theses completed during 2003. This search retrieved 2,600 bibliographic records. We then obtained a random sample from the full dataset, representing 10% or 260 bibliographic records. The list of bibliographic records in the sample set included the URLs for the theses. We retrieved each thesis using either Netscape or Internet Explorer. We then opened a second browser window to the Google search engine. We scanned each thesis for suspect phrases and attempted to find matches by either copying and pasting or retyping these phrases in Google. We placed quotation marks around the phrases so they would be searched as a phrase and not as separate words in Google. We allotted 10 minutes for reviewing each thesis for possible occurrences of plagiarism. We selected phrases from various parts of the theses. We did not search for quoted material or phrases related to the writing of the thesis (e.g. "In chapter two ..."). We permitted reasonable variations when searching in Google, such as spelling out acronyms or changing verb tenses. When a potential occurrence of plagiarism was found against the Google search engine, we recorded the elapsed time. We also recorded the searched phrase and its page number, and collected printouts or other identifying information about the matched sites in Google. If upon review of a matched website we determined a match was coincidental, the clock was restarted and we continued searching for the full 10 minutes or until a match was detected. Fifty theses from the sample of 260 were eliminated because they could not be retrieved, were duplicate records within Worldcat, or were doctoral dissertations instead of Master's theses. We did not record the names or authors of the searched theses, and we replaced the names of the universities with numeric codes.

Results

Potential occurrences of plagiarism were found in 57 of 210 (27.14%) theses.

Matched phrases were detected in theses from 14 of 22 (63.63%) of universities represented in the sample. Institution F had the highest percentage of matches (60%) while institution S had the highest number of matches (12). Matches for other institutions are illustrated in Figure 1. Figure 2 shows the breakdown of matches by broad subject category. The highest number of matches was found in Computer Engineering/Science (43.59%); Other Engineering (40%); and Mechanical Engineering and Aerospace Engineering (38.10%). The extent of the matched phrases found by Google is represented in Figure 3. Nearly 60% of the theses with potential occurrences of plagiarism contained more than one suspect long phrase (7 words or more). Average time to find a match was 3.8 minutes.

Discussion

The 27.1% match rate suggests that Google can be used to effectively detect plagiarism in some master's theses. Using Google, we found word-for-word matches for scholarly articles, abstracts from scholarly articles, non-profit agency web sites, corporate websites, scholarly conference proceedings, personal homepages, government publications, online glossaries and encyclopedias, and course pages from universities. Our sample had a high number of theses from science and technology disciplines. Further investigation is needed to determine Google's effectiveness across academic disciplines.

Google's effectiveness for detecting plagiarism appears to extend beyond sources which theses authors actually consulted on the web. We discovered instances where our searched phrases resulted from matches against other plagiarized sites. In some cases the matches would be to secondary sources that properly quoted and cited an original source, but in other cases none of the matches appeared to be original sources. In other words, Google's effectiveness for detecting plagiarism in master's theses is bolstered by the presence of other plagiarized works on the web.

Several matches were obtained against scholarly articles and books whose content were not available for viewing on the web. Increasingly publishers are making proprietary, full-content material searchable on the web, but restricting full access to organization members or paying customers. In this project we matched several phrases against such sites. This trend enhances Google's effectiveness for detecting plagiarism, because it allows phrase searching against entire article and book content. While the searcher is usually limited to viewing only an abstract or small section of text on these sites, he or she would have the option of checking the full content in a library database, submitting an interlibrary loan request for the item, or purchasing the item.

27.1% of the theses were found to have suspicious phrases within 10 minutes of searching. In 16 of the 57 theses for which matched phrases were found, they were found on or before the first numbered page. However, the project revealed several factors impacting Google's efficiency for detecting plagiarism. First, Google searching does not always produce the source of plagiarism. While this is not necessary at the onset, in most cases, a faculty member would want to track down the original source or article--particularly if the infraction was to be discussed with the student. Second, Google sometimes produces dozens of matches, and it is time consuming to wade through long result lists. Third, the nature of web resources presents a problem. While theses typically contain submission dates and author information, it is not always possible to find such information on websites. Even when it can be found, it cannot always be trusted. Finally, it is time consuming to determine whether the student authored or participated in the production of "matched" websites. The project yielded web matches to articles and abstracts that included names of thesis advisors and/or committee members, but not the student thesis author. It was not always possible to tell whether the student was plagiarizing the advisor's work, or whether the student was an un-credited author or co-author of the faculty publication. In some instances, phrases matched against web publications of research institutes or government agencies. Again, it was not always possible to determine connections or affiliations to the university at which the thesis was written or determine what role, if any, the thesis author played in these publications.

Many of the above factors would be mitigated for thesis advisors. They would often know the role a student played in previously published articles or conference presentations and they would draw on their subject expertise when selecting phrases for searching in Google as well as assessing any search results. Also, advisors would be checking the thesis prior to its submission, thus minimizing matches that result from others plagiarizing the thesis author.

Our results show that Google searching is an effective tool for detecting word-for-word plagiarism in some master's theses and that potential occurrences of plagiarism can be found relatively quickly using Google, but tracking down a plagiarized source can be more time-consuming. Further study is needed to determine the effectiveness of Google in various academic disciplines. Studies comparing Google against anti-plagiarism services and software are also needed. However, our results show that Google searching holds promise as a quick and inexpensive approach for detecting word-for-word plagiarism in theses.

References

Johnson, J. (1987). The bedford guide to the research_process. New York: St. Martin's Press.

Lathrop, A. and Foss, K. (2000). Student cheating and plagiarism in the internet era: A wake-up call. Englewood, CO: Libraries Unlimited.

Marshall, E. (1998). The internet: A powerful tool for plagiarism sleuths. Science, 279, 474.

Notess, G. Search Engine Showdown. Accessed May, 2004. http://searchengineshowdown.com.

Ryan, J. J.C.H. (2000). Student plagiarism in an online world. In A. Lathrop & K. Foss, Student cheating and plagiarism in the internet era: A wake-up call (pp. 56-59). Englewood, CO: Libraries Unlimited.

Turn-it-in. Accessed May, 2004. http://www.turnitin.com/static/legal/legal_document.html.

MARK McCuLLOUGH

MELISSA HOLMBERG

Minnesota State University, Mankato

(1) Note: Johnson identifies "word-for-word transcription of the entire passage" as one of three types of plagiarism. For this study, we checked for "word-for-word" matches of phrases at least 7 words long. We allowed also for slight variations such as changes in tense and full forms of acronyms.

Figure 1.

--Institution Found   Total%

L        0    1       0.00%
N        0    1       0.00%
V        0    1       0.00%
W        0    1       0.00%
I        0    3       0.00%
E        0    5       0.00%
U        0    5       0.00%
J        0    10      0.00%
K        1    3       33.33%
O        1    9       11.11%
R        1    10      10.00%
D        2    4       50.00%
G        2    8       25.00%
F        3    5       60.00%
C        3    10      30.00%
B        4    14      28.57%
P        5    12      41.67%
Q        5    13      38.46%
A        6    17      35.29%
T        6    21      28.57%
M        6    33      18.18%
S       12    24      50.00%

        57    210     27.14%

Found matches among 63.63% of institutions

Figure 2.

Broad Subject                                  Found   Total   %

Agricultural, Life, & Medical Sciences             6      31   19.35%
Architecture, Construction, Interior Design        2       8   25.00%
Arts & Humanities                                  1      23   4.35%
Business & Social Sciences                         5      27   18.52%
Computer Science/Engineering                      17      39   43.59%
Education                                          2      10   20.00%

Mechanical Engineering & Aerospace                 8      21   38.10%
Other Engineering                                 10      25   40.00%
Chemistry & Physical Sciences                      6      26   23.08%

Figure 3. Extent of Matched Phrases

Long phrases (7+ words)   23
Multiple long phrases     11
Entire sentence            6
Multiple sentences         7
Entire paragraph           6
Multiple paragraphs        3
Entire publication         1
                          57

COPYRIGHT 2005 Project Innovation (Alabama)
COPYRIGHT 2005 Gale Group