Crowdsourcing with reCAPTCHA: How Google is Remaking History- Part 1

ReCAPTCHA is the best example of crowdsourcing. With reCAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart), Google has taken an initiative to digitize a huge collection of books and NY Times articles and this is an appreciable project. Google acquired reCAPTCHA in September 2009 and it has become an essential part of the web today. Google claims that it has enough of content (to digitize) with it that the project can go on for years. I personally feel this awesome project should go on for decades.

Crowdsourcing can be done for a variety of motives, with various methodologies. However, reCAPTCHA takes a enhanced approach to crowdsourcing by making it into an useful security feature. reCAPTCHA is used as a reverse Turing test to determine whether a user is a human or a bot, the determination being done by a computer in this case. The advantage for those putting reCAPTCHA before form-fills on their websites, is strict regulation of content and protection from automated scripts. These scripts are used to spread spam after all!

How well does it work?

OCR Fail

Google claims and I believe too, it works better than OCR. You can see these samples here. The human eye can perform the best OCR. Transliterations done with reCAPTCHA are more than 99% accurate.

Two different OCR programs analyze the image; their respective outputs are then aligned with each other by standard string matching algorithms (7) and compared to each other and to an English dictionary. Any word that is deciphered differently by both OCR programs or that is not in the English dictionary is marked as suspicious.These are typically the words that the OCR programs failed to decipher correctly. According to our analysis, about 96% of these suspicious words are recognized incorrectly by at least one of the OCR programs; conversely, 99.74% of the words not marked as suspicious are deciphered correctly by both programs. Each suspicious word is then placed in an image along with another word for which the answer is already known, the two words are distorted further to ensure that automated programs cannot decipher them, and the resulting image is used as a CAPTCHA.

Google has a pretty old PDF saying there were 40,000+ websites using reCAPTCHA but recent stats (from 3rd May 2011) say there are 96,000+ reCAPTCHAs in place worldwide. Are you impressed? We have covered the stats in more details in the second part of this series.

What is in it for Google?

Google gets more content to share, sell, advertise with and hold in possession. The project feeds content to Google News and Google Books, which brings in more revenue from advertisements.

Do not forget to check out the second post in this reCAPTCHA series, where we have crunched some numbers and came up with some interesting stats.

Published by

Chinmoy Kanjilal

Chinmoy Kanjilal is a FOSS enthusiast and evangelist. He is passionate about Android. Security exploits turn him on and he loves to tinker with computer networks. You can connect with him on Twitter @ckandroid.