Why are they spamming my site?
The point of comment spam is that spammers want to get their web pages ranked highly on search engines: they want their sites to be on the first page of search results for certain keywords when someone googles for these keywords.
How do I help them get on the 1st page of Google?
Getting ranked on Google is a complicated topic, but the concept that is relevant here is PageRank. To generate PageRank you need incoming, relevant links to a specific page. Some SEO experts might consider this a too simplistic explanation, but in a nutshell it means you need relevant keywords to appear on the page that links to you, preferably ‘close’ to the link. Bonus points if you can make it the link’s anchor text.
Another important factor is that the ranking of the page that links to you counts too. Suppose you’re trying to sell bike sheds. The best incoming links would be from a top-rated site that covers and reviews bike sheds. A link from an obscure forum that covers a completely unrelated topic won’t rank nearly as much as a link from an expert on the topic - almost like in the real world.
How spammers & their bots operate
Spammers often set up networks of sites with carefully crafted pages that all ultimately link to each other and back to the pages they are trying to promote. They then try and rank these sites by spamming other sites’ pages with comments: this is where your site comes to play.
They automatically try and post content to anything they find that looks like a comment form through the use of an automated script: a spambot. The comment will contain a link as well as some keywords that they are trying to promote. Usually the keywords are related to what their sites are trying to sell, but sometimes they also try and hijack unrelated searches. The link is necessary to boost their own page rank (without it comment spam would be absolutely pointless) and the keywords are necessary to associate the link with something of their choosing, but also to poison (stuff) your page so that it looks like your page is a bit more relevant to their topic than it really is. If this topic is of interest to you we suggest you read up on spamdexing.
How can one block spam?
There are different ways of blocking spam on a form on a page: each with certain pros and cons.
Block all links from being posted
There would be no comment spam if we didn’t allow links to be posted as it would defeat the purpose of getting back-links, but fairly often people do want to leave legitimate links in comments. It is great to be able credit your own site as part of your comment, in fact this is an integral part of how the Internet works. There goes that idea.
Moderate comments
Some websites go as far as to moderate all comments, which means no comment will appear on a site until an administrator has checked it. This could potentially destroy the flow of conversation and confuse those who add valid comments as their comment won’t appear immediately and they might even end up duplicating comments.
Differentiate between spambots and real people
Spambots are scripts that try and act like a real person using a browser, however many spambots don’t behave in the same way a real user would: they’ll fill in all the fields in a form (even ones that would appear hidden to a user); they don’t interpret JavaScript; they might not be browsing between URLs in the same way a user does; they might not receive cookies or send them along in the same way; a bot might submit a lot of comments in a short period of time using the same email address or IP address; and many other things that a human will never do but a bot might, and of course vice versa.
It is possible to check for these triggers and it’s fairly effective to simply ignore comments matching situations that real users would never trigger but nowadays this only stops the most basic bots.
A very effective differentiation method is through the use of CAPTCHA images, but they are a pain to have to complete each time and pose accessibility issues too: a CAPTCHA might be so effective that it could block out real people with mild visual disabilities. On the other hand even CAPTCHAs fail too.
Bayesian Filtering
Bayesian spam filtering is a technique first popularised by Paul Graham that’s often used to filter email spams, but it can also be used to filter spam comments. Email spam has different objectives to comment spam - it is aimed directly at people whereas comment spams are aimed at search engines. This makes it quite a bit different to filter than comment spam. Not really easier or more difficult, just different.
Heuristic methods
A very resilient method to filter spams is to assign each spam message a score. Let’s say a comment’s score starts at 0 and the score has to be at least 1 to be let through as a comment. If it is below 1 you mark it as a spam.
One then looks for certain patterns (sometimes called “smells”) in a comment and add points to the score for things that make a comment appear more legit and you subtract points for things that look spammy. The beauty of this method is that something like one spammy word won’t automatically make a comment look like spam if it looks legitimate enough otherwise. It is also easy to add points to comments where you know the user has already commented successfully before, or the user has passed a CAPTCHA test before or the comment contains no links, etc. You can also subtract extra points if the IP address, email address, URL or comment body has been used in a spam before. Or you can also do flood detection - if the commenter added an unrealistic number of comments in a short period of time then you can penalise it further.
Self-moderation
One thing that makes it easier to filter comment spam than email spam is that you have more ways to deal with false positives. If you mark a legitimate email as spam, then there’s nothing the person that sent the mail can do to rectify the situation. But if a legitimate comment didn’t make it through, you can easily notify the commenter and the good news is that most bots wouldn’t pick up on it.
Most false positives that didn’t make it through will therefore have a score of 0 rather than 1. In this case you can present another form to the user that she will have to fill in to boost the score and remove the spam flag. In other words, the user validates (or moderates) her own comment by proving she is human. We then remember that she validated herself and then next time around boost the score automatically.
The methods we use on your sites
We used to do Bayesian filtering (based on Divmod’s Reverend) in combination with a couple of other methods and it worked fairly effectively - up to now. The spam database has grown to such a point it started to reduce performance: notably the app’s startup time. We had to either create a separate spam filtering server or abandon this method in favour of heuristic methods.
We took another look at all the comments (and spams) we received over the past few years, researched other people’s ideas, experimented with a new spam filter and realised we can do much better.
We now do some magic to block the most obvious bots, then we assign scores to comments and if a comment appears to be spam we ask the user a simple question to ‘prove that they are human’ as another chance to make the comment pass along.
(It is possible to combine Bayesian filtering with our current score-based methods by adding or subtracting points if a comment looks like spam or not. We’ll look at adding that but for now it doesn’t seem necessary.)
Is it working?
For now, yes. However spammers are constantly tweaking and evolving their bots and inevitably some of them will get through. If they do, please use the ‘report spam’ functionality and we’ll tweak our spam filter again: we have some more tricks up our sleeves.
What about the mechanical turks?
You can also get ‘manual’ comment and contact spam that’s more difficult to detect and filter. Some spams actually get written and submitted by real people (so-called mechanical turks) and these comments are usually better worded to bypass filters. A real person can validate themselves easily passing CAPTCHA style tests. There’s nothing you can do but report these messages manually as soon as they appear. Luckily there are relatively few of them compared to bot spam (due to the trouble and time involved in posting spam like this), and the content of these messages are usually not too offensive. If you delete them quickly enough the spammers will hopefully stop bothering.
Please feel welcome to comment below should you want to know more.