Stanford Researchers Breach Captcha Security Codes

Captcha is no surety of safety, demonstrated a group from Stanford University, thwarting the best guard we have against automated attacks. Captcha is supposed to be breakable only by humans, but not by bots or any other automated machines. A word or phrase, written in a style that cannot be read by a text editor is the method to achieve this. Users have to enter this code in order to gain access. It was developed at Carnegie Mellon University by a graduate student in 2000. Captcha is actually a fancy acronym for a bland sentence Completely Automated Public Turing Test to tell Computers and Humans Apart.

Breached!!

The Decaptcha

Stanford Security Laboratory post-doctoral researchers Elie Bursztein, Matthieu Martin and John C. Mitchell busted that myth as they created a tool, named DeCaptcha, that breaks codes 13 out of 15times. The sites used for testing were high-profile sites like CNN, Visa, eBay and Wikipedia. Bursztein says:

For example, our automated Decaptcha tool breaks the Wikipedia scheme… approximately 25% of the time. 13 out of 15 of the most widely used current schemes are similarly vulnerable to automated attack by our tool. Therefore, there is a clear need for a comprehensive set of design and testing principles that will lead to more robust captchas

The principle for the working for Decaptcha is simple it just reduces background noise, breaks strings into single characters and recognizes the pattern. It achieved varying degrees of success at various sites. It broke Visa’s Authorise.net 66% of the time and eBay 43% of the time. Wikipedia clocked in at 25% in the rate of being breached.

The team shared a report elucidating the strengths and weaknesses of the Captcha method. The link is given below.

Report link:  http://cdn.ly.tl/publications/text-based-captcha-strengths-and-weaknesses.pdf

Google Untouched!

There is, however, some good news for those seeking online security. Google was unbeatable and so was reCAPTCHA. reCAPTCHA is an improved version of Captcha, which makes it more difficult for bots to recognize patterns by warping and twisting words into strange forms readable only by humans. Google now owns reCAPTCHA, which it acquired in 2009. On these two cases, Decaptcha scored no breaches.

Not yet breached!

The bottom line is that Captcha needs to be upgraded. Next time you feel smug about getting in a site by correctly typing in the captcha code, think twice. There are some smart computer programs sharing the same cyberspace!

Report on strengths and weaknesses of Captcha:  http://cdn.ly.tl/publications/text-based-captcha-strengths-and-weaknesses.pdf

Crowdsourcing with reCAPTCHA: How Google is Remaking History- Part 2

I was quite fascinated by reCAPTCHA during my college days and this is a two part series on my analysis of reCAPTCHA. You can check out the first part in the reCAPTCHA series here, where we talk about crowdsourcing and explain how reCAPTCHA works.

With reCAPTCHA, Google directs the vast amount of energy wasted filling useless CAPTCHA text into transliterating old copies of New Your Times and old books. All this content goes into Google Books and Google News archives and it is an impressive project that digitizes content, which might have been lost otherwise, in coming centuries.

ReCAPTCHA Usage Stats

recaptcha-logo

This graph shows the gradual rise in reCAPTCHA usage from 2008 to 2011. There are 30+ million reCAPTCHAs being solved every day and when combined with 96,000 sites, this makes for some interesting findings. There are only 560 of the top million websites using reCAPTCHA right now.

Popular names like Craigslist, Facebook and 4chan, all use reCAPTCHA to stop spam and display a total of 100 million CAPTCHAs each day.

After a year of usage, reCAPTCHA had solved 1.2 billion CAPTCHAs with 440 million suspicious words that were deciphered successfully. These stats make no meaning unless you can imagine them in terms of books. So, 440 million is as good as The God Delusion 25,000 times but that is not impressive enough for me! Maybe I am doing it wrong. Those are stats for the first year only!

ReCAPTCHA was in place on 40,000+ websites when those stats were released and now, it is already on 96,000+ websites. That is a 240% of the initial size. Going by linear increase (considering average usage), reCAPTCHA might be doing 1050 million suspicious words this year. From 2008, this makes a whopping 29920000000 suspicious words alone which is as good as all of Shakespeare’s plays (39 of them with 928913 words), 32,200 times, all in the last three years.

Did you think shorter words are easier to solve? Well, longer words provide the user with a better guess. Thus, words with 4 letters are solved with an accuracy of  93.7% whereas words with seven characters achieve a 96.7% accuracy. Moreover, it takes the same time to solve a difficult CAPTCHA image as compared to an easy reCAPTCHA image, because it is a (near) known word.

How secure is it?

ReCAPTCHA can be seen below as it appeared over the last few years.

recaptcha-2007
The year 2007

recaptcha-2010
The year 2010

recaptcha-2011
The year 2011

Security is the real question here, as it is the sole reason to use reCAPTCHA. How secure is this whole process? I am not talking about managing the OCR here because if you can, there is a very good reason to stop spamming and build a billion dollar company with that killer OCR technology. The real deal is what happens in each cycle of reCAPTCHA submission. Is the asymmetric key approach used by reCAPTCHA secure enough?

A security researcher  Jonathan Wilkins has taken a firm stand against reCAPTCHA proving it is vulnerable on multiple occasions. With each newer version, Google claims that their CAPTCHAs are secure and Wilkins cracks it with a greater accuracy to reveal exploits. The last version of reCAPTCHA (the one using a strike-through line) was successfully cracked by Wilkins with a 20% accuracy. However, he has not started working on the new reCAPTCHA system with double images. Another hacker  Chad Houck has also outlined methods to hack reCAPTCHA.

Some stats in this post were used with permission from trends.builtwith.com. You can see their FAQ page for more info.

You can also look at this relatively old document [link to PDF file] that talks about reCAPTCHA.

Crowdsourcing with reCAPTCHA: How Google is Remaking History- Part 1

ReCAPTCHA is the best example of crowdsourcing. With reCAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart), Google has taken an initiative to digitize a huge collection of books and NY Times articles and this is an appreciable project. Google acquired reCAPTCHA in September 2009 and it has become an essential part of the web today. Google claims that it has enough of content (to digitize) with it that the project can go on for years. I personally feel this awesome project should go on for decades.
recaptcha-logo

Crowdsourcing can be done for a variety of motives, with various methodologies. However, reCAPTCHA takes a enhanced approach to crowdsourcing by making it into an useful security feature. reCAPTCHA is used as a reverse Turing test to determine whether a user is a human or a bot, the determination being done by a computer in this case. The advantage for those putting reCAPTCHA before form-fills on their websites, is strict regulation of content and protection from automated scripts. These scripts are used to spread spam after all!

How well does it work?

recaptcha-logo
OCR Fail

Google claims and I believe too, it works better than OCR. You can see these samples here. The human eye can perform the best OCR. Transliterations done with reCAPTCHA are more than 99% accurate.

Two different OCR programs analyze the image; their respective outputs are then aligned with each other by standard string matching algorithms (7) and compared to each other and to an English dictionary. Any word that is deciphered differently by both OCR programs or that is not in the English dictionary is marked as suspicious.These are typically the words that the OCR programs failed to decipher correctly. According to our analysis, about 96% of these suspicious words are recognized incorrectly by at least one of the OCR programs; conversely, 99.74% of the words not marked as suspicious are deciphered correctly by both programs. Each suspicious word is then placed in an image along with another word for which the answer is already known, the two words are distorted further to ensure that automated programs cannot decipher them, and the resulting image is used as a CAPTCHA.

Google has a pretty old PDF saying there were 40,000+ websites using reCAPTCHA but recent stats (from 3rd May 2011) say there are 96,000+ reCAPTCHAs in place worldwide. Are you impressed? We have covered the stats in more details in the second part of this series.

What is in it for Google?

Google gets more content to share, sell, advertise with and hold in possession. The project feeds content to Google News and Google Books, which brings in more revenue from advertisements.

Do not forget to check out the second post in this reCAPTCHA series, where we have crunched some numbers and came up with some interesting stats.