Crowdsourcing with reCAPTCHA: How Google is Remaking History- Part 2
By on May 9th, 2011

I was quite fascinated by reCAPTCHA during my college days and this is a two part series on my analysis of reCAPTCHA. You can check out the first part in the reCAPTCHA series here, where we talk about crowdsourcing and explain how reCAPTCHA works.

With reCAPTCHA, Google directs the vast amount of energy wasted filling useless CAPTCHA text into transliterating old copies of New Your Times and old books. All this content goes into Google Books and Google News archives and it is an impressive project that digitizes content, which might have been lost otherwise, in coming centuries.

ReCAPTCHA Usage Stats

recaptcha-logo

This graph shows the gradual rise in reCAPTCHA usage from 2008 to 2011. There are 30+ million reCAPTCHAs being solved every day and when combined with 96,000 sites, this makes for some interesting findings. There are only 560 of the top million websites using reCAPTCHA right now.

Popular names like Craigslist, Facebook and 4chan, all use reCAPTCHA to stop spam and display a total of 100 million CAPTCHAs each day.

After a year of usage, reCAPTCHA had solved 1.2 billion CAPTCHAs with 440 million suspicious words that were deciphered successfully. These stats make no meaning unless you can imagine them in terms of books. So, 440 million is as good as The God Delusion 25,000 times but that is not impressive enough for me! Maybe I am doing it wrong. Those are stats for the first year only!

ReCAPTCHA was in place on 40,000+ websites when those stats were released and now, it is already on 96,000+ websites. That is a 240% of the initial size. Going by linear increase (considering average usage), reCAPTCHA might be doing 1050 million suspicious words this year. From 2008, this makes a whopping 29920000000 suspicious words alone which is as good as all of Shakespeare’s plays (39 of them with 928913 words), 32,200 times, all in the last three years.

Did you think shorter words are easier to solve? Well, longer words provide the user with a better guess. Thus, words with 4 letters are solved with an accuracy of  93.7% whereas words with seven characters achieve a 96.7% accuracy. Moreover, it takes the same time to solve a difficult CAPTCHA image as compared to an easy reCAPTCHA image, because it is a (near) known word.

How secure is it?

ReCAPTCHA can be seen below as it appeared over the last few years.

recaptcha-2007
The year 2007

recaptcha-2010
The year 2010

recaptcha-2011
The year 2011

Security is the real question here, as it is the sole reason to use reCAPTCHA. How secure is this whole process? I am not talking about managing the OCR here because if you can, there is a very good reason to stop spamming and build a billion dollar company with that killer OCR technology. The real deal is what happens in each cycle of reCAPTCHA submission. Is the asymmetric key approach used by reCAPTCHA secure enough?

A security researcher  Jonathan Wilkins has taken a firm stand against reCAPTCHA proving it is vulnerable on multiple occasions. With each newer version, Google claims that their CAPTCHAs are secure and Wilkins cracks it with a greater accuracy to reveal exploits. The last version of reCAPTCHA (the one using a strike-through line) was successfully cracked by Wilkins with a 20% accuracy. However, he has not started working on the new reCAPTCHA system with double images. Another hacker  Chad Houck has also outlined methods to hack reCAPTCHA.

Some stats in this post were used with permission from trends.builtwith.com. You can see their FAQ page for more info.

You can also look at this relatively old document [link to PDF file] that talks about reCAPTCHA.

Tags: ,
Author: Chinmoy Kanjilal Google Profile for Chinmoy Kanjilal
Chinmoy Kanjilal is a FOSS enthusiast and evangelist. He is passionate about Android. Security exploits turn him on and he loves to tinker with computer networks. He rants occasionally at Techarraz.com. You can connect with him on Twitter @ckandroid.

Chinmoy Kanjilal has written and can be contacted at chinmoy@techie-buzz.com.
  • http://www.asfaq.com Asfaq Tapia

    ReCaptcha or even Captcha is so easy to crack its not even funny. It works on the principle that out of the two words, one is known and the other isnt. Google monitors the other unknown word and over-time detects what most people entered the word as to ‘learn’ the word.

    Therefore, you only have to get the most visible word right in the captcha message in order to bypass the system.

    • http://www.techarraz.com Chinmoy Kanjilal

      It still takes away the possibility of an automated process and needs humans to do this thing. I think reCAPTCHA is not a security haven, but it works much better than those shitty crackable CAPTCHAs. Microsoft’s ASIRRA is an alien.

  • Beluga

    10% accuracy is enough for spambots to be useful. The Russian XRumer forum spamming software can solve reCaptchas, look it up. Stopforumspam and similar blacklists are the only reasonable solution for spam blocking.

 
Copyright 2006-2012 Techie Buzz. All Rights Reserved. Our content may not be reproduced on other websites. Content Delivery by MaxCDN