Spam, egg, Spam, Spam, bacon and Spam

Nobody likes spam. Except the people that send it, I suppose. But I don’t like spam, and you probably don’t like spam. Any website that allows users to submit data is going to have to deal with spam at some point. Unbounce and our customers are no exception.

(Transcript: http://www.detritus.org/spam/skit.html)

<a href=http://www.kayakingplus.com/cheap-nhl-jerseys-c-7.html><b>buy cheap nhl hockey jerseys from china</b></a>
<a href="http://www.kayakingplus.com/cheap-louis-vuitton-outlet-c-17.html" >cheap louis vuitton outlet</a>
[url=http://www.kayakingplus.com/cheap-louis-vuitton-outlet-c-17.html]cheap louis vuitton outlet[/url]

In the past, we dealt with spam by completely blocking requests from IPs that had a history of submitting spam, after customers brought them to our attention. Some spam comes from people’s home or business computers that have been infected with botnet software, due to outdated virus checkers, unapplied security fixes, or 0-day exploits.

Due to the way IP addresses are dynamic allocated on most parts of the internet, it’s quite likely that the infected computer will get assigned a different IP eventually. However, once we block an IP, we don’t see it’s traffic any more, so we can’t tell when (or if) that IP stops sending spam. As a result, IPs that end up on our block list stay there permanently.

Obviously, that wasn’t an ideal situation. We needed something better. That something is the Reputator.

Introducing the Reputator

Reputator is a new service we’ve deployed that aggregates reputation information about IPs from sources around the internet, and decides if their reputation is suspect enough for us to be confident that the traffic is from a bot. If it is, Reputator informs our page servers of the IP, and they begin flagging traffic from it. Reputator will continue to monitor that IP’s reputation, and if it eventually cleans up its act, will stop flagging it.

Technical Details

Reputator is something that will be refined over time, as we’re able to study the impact it has, but this is what we’re starting with.

Reputator aggregates data from four sources:

1. StopForumSpam.com

From their site:

We provide lists of spammers that persist in abusing forums and blogs with their scams, ripoffs, exploits and other annoyances

We regularly download a number of the files StopForumSpam.com provides, containing lists of spammer IPs, and use these to provide a reputation score. The more recently that an IP has been seen sending spam, the higher the score we assign.

2. ProjectHoneyPot.com

Similar to StopForumSpam.com, the ProjectHoneyPot.com service provides a way to get a threat score for an IP. The higher their threat score, the higher the score that this component of Reputator assigns an IP.

3. BearTrap

BearTrap is the name of our internally developed honeypot, using techniques from http://nedbatchelder.com/text/stopbots.html. IPs that get caught in the BearTrap are assigned a score based on how often that happens.

4. ELK

ELK is one of the analytics tools we use. Every view of a page or submission of a form is recorded in ELK. Reputator uses it to see how many requests an IP has made to us over various time periods. Each of these time periods has a threshold, and the closer to the requests-per-period threshold the IP is, the higher the score assigned to it is.

Reputator takes the score from all of those sources, assigns various weights to each, and produces an aggregate score. If that score is higher than 1.0, then that IP goes on the naughty list. It won’t get back on the nice list until it’s score drops below 1.0 for a sufficient amount of time.

When Reputator flags an IP, it outputs a format that gives us a lot of information about why it was flagged, so that we can adjust settings in the event of incorrectly classifying an IP as suspicious.

Let’s take an example:

(SFS=0.16=>0.53;PHP[CommentSpammer,Suspicious]=0.31=>0.62)=1.15

We can walk through the format to see what logic Reputator applied.

The StopForumSpam (SFS) check returned a score of 0.16, which after applying a weight of 3.34 to, resulted in a score of 0.53. That’s below 1.0, so we move on to the next check.

ProjectHoneyPot (PHP) came back with a threat score of 80 (out of 255). We divide that by 255 to get a score in the range of 0.0-1.0, which gives us 0.31. Applying a weight of 2.0 gives a score of 0.62. Combining those two scores gives 1.15, which is over 1.0, so we don’t need to check the other sources.

Rollout

We rolled Reputator out into production gradually, so we could see the impact. Initially, we only turned on the BearTrap part of it, and didn’t use the other reputation sources. The results of this looked good, but having it in production did highlight a few small bugs, which we were able to fix before enabling the remaining reputation sources.

Using ELK, we have excellent visibility into how many requests are being flagged, what IPs they come from, and much much more. Here’s what it looked like when we turned on the remainder of Reputator’s reputation sources:

Yellow are the IPs we’ve been manually blocking over the years. Until we’re completely happy with how Reputator is working, we will continue to flag requests from these IPs.
Purple are requests that were caught in the BearTrap. Light purple are HTTP GET requests, and dark purple are POSTs.
Blue are requests that Reputator caught without BearTrap, using its other reputation sources. Light blue are HTTP GET requests, and dark blue are POSTs.

The graph clearly shows the fourfold increase in spam we’re catching once we turned on all of Reputator’s data sources.

Beating spammers is a cat and mouse game. We improve our ability to catch spam, without accidentally catching legitimate traffic, and, in turn, they improve their ability to sneak past our filters. With Reputator, we’ve up-ed our game in yet another way. Reputator doesn’t block requests from suspicious IPs, like our old method did. The requests are still allowed through, but they’re flagged internally so they don’t count as page views, leads, or conversions. From the spambot’s point of view though, there’s no difference. They can’t see that they’ve been blocked, so they won’t know that they need to adapt their techniques, or move on to a different IP. This gives us a bit of a leg-up in the cat and mouse game.

Keeping spam out of our customer’s page views, leads, and conversions will be an ongoing battle with the spammers. With Reputator, we’ve made our first move, and positioned ourselves to be able to refine and adapt as we see how it works, and how spammers’ techniques change over time.

–Derek Lewis,
Senior Software Developer

Get Inside Unbounce