On Wed, 2012-02-15 at 21:56 -0500, John Taylor-Johnston wrote: > I'm a teacher. I want to use PHP to interface with Google and see if a > student has plagiarized. > > I don't see many open-source projects on the subject, so I want to > create my own script. > > How can I use PHP to interface with Google and see if this text exists > on the internet? > > If this is possible, I need some ideas on how to parse the text and > input it into Google. > > Then I might like to get a percentage idea of how this text compares to > a site that Google has indexed. > > > $SampleText = "Lorem ipsum dolor sit amet, test link adipiscing elit. > Nullam dignissim convallis est. Quisque aliquam. Donec faucibus. Nunc > iaculis suscipit dui. Nam sit amet sem. Aliquam libero nisi, imperdiet > at, tincidunt nec, gravida vehicula, nisl. Praesent mattis, massa quis > luctus fermentum, turpis mi volutpat justo, eu volutpat enim diam eget > metus. Maecenas ornare tortor. Donec sed tellus eget sapien fringilla > nonummy. Mauris a ante. Suspendisse quam sem, consequat at, commodo > vitae, feugiat in, nunc. Morbi imperdiet augue quis tellus." > > John > > Wow, that's a pretty big project you're chewing there. A quick search shows that there are some project out there to detect plagiarism, but I think for university calibre there's a hefty sum of money required. To get a rough idea, you could break a text into sentences, and then query each one of those to see if it occurs just like that. You can use cURL to grab search results pages for this sort of thing, no need for a special interface. There are a few things to bear in mind though: * Googles terms and conditions may prohibit using their search engine like this, or may impose a limit on how much you can do this * Some sentences will be intentionally copied, as quotes. Maybe some sort of check against the source to see if it's in a quote context. * What if only part of a sentence is copied? Maybe after you've searched for exact matches from the sentences in the source, you could remove them from the source, then re-check every sentence against Googles fuzzy search. It may produce many false positives though. There are plenty of other factors too, such as students copying from books which don't exist in a search engines archives, some subjects may unintentionally result in the same way of wording, particularly technical subjects which tend to be removed from more creative and flowery descriptive tendencies. -- Thanks, Ash http://www.ashleysheridan.co.uk