Email spam might be dropping, but that doesn’t mean that spam is about to go away. It’s just that the spammers have found new, and possibly more fruitful, vehicles for spreading junk. This includes search engines, twitter, and of course, SMS. In spite of the establishment of the NDNC (National Do Not Call) registry, SMS spam is rapidly increasing in India. I personally consider SMS spam to be much more of a nuisance than email spam, simply because, although there are sophisticated spam filters available to tackle email spam, when it comes to SMS spam we are mostly helpless. Manual blacklisting of repeat offenders is the best most of us can do.
For separating spam from ham, most email filters utilize two techniques:
Heuristic Approach: The software learns to distinguish spam from experience by learning from the content of already processed messages.
Bayesian Approach: It’s a statistical approach that employs a probabilistic model to determine if a message is spam, based on pre-defined classifiers.
Unfortunately, these methods alone are not very effective when it comes to tackling SMS spam. The short length of messages, coupled with the use of abbreviations and vernacular languages make it very tough for machine learning algorithms to work with acceptable accuracy.
Now, a team of students at the Indraprastha Institute of Information Technology (Delhi) are trying to tackle this problem by employing the intelligence of the crowd. The team lead by Dr. Ponnurangam Kumaraguru, includes Vinayak Naik, Kuldeep Yadav, Atul Goyal, Ashish Gupta, Dipesh Kumar Singh, and Rushil Khurana.
For developing the initial proof-of-concept, the team ran an incentivized crowd-sourcing scheme in the IIIT-D campus (organized through Facebook) to collect sample spam messages. Pictured below is the tag cloud of the initial database of 4,318 messages, out of which nearly half were spam.
Tag Cloud for SPAM (left) and HAM (right)
Some of the interesting observations made by the team from the initial training set are:
- Almost all messages including an URL are spam.
- Certain special characters like /’ are frequently present in spam messages.
- Typically word count of spam messages is higher. Also the average word length in legitimate messages is shorter due to the presence of abbreviations.
For spam filtering, two techniques were explored by the research group Bayesian and SVM (Support vector machine). SVM is a supervised machine learning technique commonly used for classification. With SpamAssassin, Bayesian approach yielded lower than desired accuracy with spam classification; however, SVM was too computationally heavy for low and mid-range mobile devices, and it had a lower success rate with classification of hams. Dr. Ponnurangam’s team is currently working on an online module that will run a pre-trained SVM based classifier on the server and pass on the results to the app.
A Symbian app, which will offer full featured spam protection on mobile phones with or without data connection, is currently in the final stages of development. The choice of Symbian as the launch platform might surprise some; however, the decision was likely inspired by the ground situation in India. Nokia still has a significant presence in India, and it dominates the mid-range segment. In terms of volume, I suspect Android is still quite far behind Symbian. That being said, an Android app is planned, and will possibly be released later in the summer. In the meanwhile you can check out the research paper for getting a better understanding of the underlying technology.