I was quite fascinated by reCAPTCHA during my college days and this is a two part series on my analysis of reCAPTCHA. You can check out the first part in the reCAPTCHA series here, where we talk about crowdsourcing and explain how reCAPTCHA works.
With reCAPTCHA, Google directs the vast amount of energy wasted filling useless CAPTCHA text into transliterating old copies of New Your Times and old books. All this content goes into Google Books and Google News archives and it is an impressive project that digitizes content, which might have been lost otherwise, in coming centuries.
ReCAPTCHA Usage Stats
This graph shows the gradual rise in reCAPTCHA usage from 2008 to 2011. There are 30+ million reCAPTCHAs being solved every day and when combined with 96,000 sites, this makes for some interesting findings. There are only 560 of the top million websites using reCAPTCHA right now.
Popular names like Craigslist, Facebook and 4chan, all use reCAPTCHA to stop spam and display a total of 100 million CAPTCHAs each day.
After a year of usage, reCAPTCHA had solved 1.2 billion CAPTCHAs with 440 million suspicious words that were deciphered successfully. These stats make no meaning unless you can imagine them in terms of books. So, 440 million is as good as The God Delusion 25,000 times but that is not impressive enough for me! Maybe I am doing it wrong. Those are stats for the first year only!
ReCAPTCHA was in place on 40,000+ websites when those stats were released and now, it is already on 96,000+ websites. That is a 240% of the initial size. Going by linear increase (considering average usage), reCAPTCHA might be doing 1050 million suspicious words this year. From 2008, this makes a whopping 29920000000 suspicious words alone which is as good as all of Shakespeare’s plays (39 of them with 928913 words), 32,200 times, all in the last three years.
Did you think shorter words are easier to solve? Well, longer words provide the user with a better guess. Thus, words with 4 letters are solved with an accuracy of 93.7% whereas words with seven characters achieve a 96.7% accuracy. Moreover, it takes the same time to solve a difficult CAPTCHA image as compared to an easy reCAPTCHA image, because it is a (near) known word.
How secure is it?
ReCAPTCHA can be seen below as it appeared over the last few years.
The year 2007
The year 2010
The year 2011
Security is the real question here, as it is the sole reason to use reCAPTCHA. How secure is this whole process? I am not talking about managing the OCR here because if you can, there is a very good reason to stop spamming and build a billion dollar company with that killer OCR technology. The real deal is what happens in each cycle of reCAPTCHA submission. Is the asymmetric key approach used by reCAPTCHA secure enough?
A security researcher Jonathan Wilkins has taken a firm stand against reCAPTCHA proving it is vulnerable on multiple occasions. With each newer version, Google claims that their CAPTCHAs are secure and Wilkins cracks it with a greater accuracy to reveal exploits. The last version of reCAPTCHA (the one using a strike-through line) was successfully cracked by Wilkins with a 20% accuracy. However, he has not started working on the new reCAPTCHA system with double images. Another hacker Chad Houck has also outlined methods to hack reCAPTCHA.
You can also look at this relatively old document [link to PDF file] that talks about reCAPTCHA.