Get Inside Unbounce

Subscribe

Spam, egg, Spam, Spam, bacon and Spam

Nobody likes spam. Except the people that send it, I suppose. But I don’t like spam, and you probably don’t like spam. Any website that allows users to submit data is going to have to deal with spam at some point. Unbounce and our customers are no exception.

(Transcript: http://www.detritus.org/spam/skit.html)

<a href=http://www.kayakingplus.com/cheap-nhl-jerseys-c-7.html><b>buy cheap nhl hockey jerseys from china</b></a>
<a href="http://www.kayakingplus.com/cheap-louis-vuitton-outlet-c-17.html" >cheap louis vuitton outlet</a>
[url=http://www.kayakingplus.com/cheap-louis-vuitton-outlet-c-17.html]cheap louis vuitton outlet[/url]

In the past, we dealt with spam by completely blocking requests from IPs that had a history of submitting spam, after customers brought them to our attention.  Some spam comes from people’s home or business computers that have been infected with botnet software, due to outdated virus checkers, unapplied security fixes, or 0-day exploits.

Due to the way IP addresses are dynamic allocated on most parts of the internet, it’s quite likely that the infected computer will get assigned a different IP eventually.  However, once we block an IP, we don’t see it’s traffic any more, so we can’t tell when (or if) that IP stops sending spam.  As a result, IPs that end up on our block list stay there permanently.

Obviously, that wasn’t an ideal situation.  We needed something better.  That something is the Reputator.

Introducing the Reputator

Reputator is a new service we’ve deployed that aggregates reputation information about IPs from sources around the internet, and decides if their reputation is suspect enough for us to be confident that the traffic is from a bot.  If it is, Reputator informs our page servers of the IP, and they begin flagging traffic from it.  Reputator will continue to monitor that IP’s reputation, and if it eventually cleans up its act, will stop flagging it.

Technical Details

Reputator is something that will be refined over time, as we’re able to study the impact it has, but this is what we’re starting with.

Reputator aggregates data from four sources:

  1. StopForumSpam.com
  2. ProjectHoneyPot.com
  3. BearTrap
  4. ELK

1. StopForumSpam.com

From their site:

We provide lists of spammers that persist in abusing forums and blogs with their scams, ripoffs, exploits and other annoyances

We regularly download a number of the files StopForumSpam.com provides, containing lists of spammer IPs, and use these to provide a reputation score.  The more recently that an IP has been seen sending spam, the higher the score we assign.

2. ProjectHoneyPot.com

Similar to StopForumSpam.com, the ProjectHoneyPot.com service provides a way to get a threat score for an IP.  The higher their threat score, the higher the score that this component of Reputator assigns an IP.

3. BearTrap

BearTrap is the name of our internally developed honeypot, using techniques from http://nedbatchelder.com/text/stopbots.html.  IPs that get caught in the BearTrap are assigned a score based on how often that happens.

4. ELK

ELK is one of the analytics tools we use.  Every view of a page or submission of a form is recorded in ELK.  Reputator uses it to see how many requests an IP has made to us over various time periods.  Each of these time periods has a threshold, and the closer to the requests-per-period threshold the IP is, the higher the score assigned to it is.

Reputator takes the score from all of those sources, assigns various weights to each, and produces an aggregate score.  If that score is higher than 1.0, then that IP goes on the naughty list.  It won’t get back on the nice list until it’s score drops below 1.0 for a sufficient amount of time.

When Reputator flags an IP, it outputs a format that gives us a lot of information about why it was flagged, so that we can adjust settings in the event of incorrectly classifying an IP as suspicious.

Let’s take an example:

(SFS=0.16=>0.53;PHP[CommentSpammer,Suspicious]=0.31=>0.62)=1.15

We can walk through the format to see what logic Reputator applied.

The StopForumSpam (SFS) check returned a score of 0.16, which after applying a weight of 3.34 to, resulted in a score of 0.53.  That’s below 1.0, so we move on to the next check.

ProjectHoneyPot (PHP) came back with a threat score of 80 (out of 255).  We divide that by 255 to get a score in the range of 0.0-1.0, which gives us 0.31.  Applying a weight of 2.0 gives a score of 0.62.  Combining those two scores gives 1.15, which is over 1.0, so we don’t need to check the other sources.

Rollout

We rolled Reputator out into production gradually, so we could see the impact.  Initially, we only turned on the BearTrap part of it, and didn’t use the other reputation sources.  The results of this looked good, but having it in production did highlight a few small bugs, which we were able to fix before enabling the remaining reputation sources.

Using ELK, we have excellent visibility into how many requests are being flagged, what IPs they come from, and much much more.  Here’s what it looked like when we turned on the remainder of Reputator’s reputation sources:

reputation-flagged

Yellow are the IPs we’ve been manually blocking over the years.  Until we’re completely happy with how Reputator is working, we will continue to flag requests from these IPs.
Purple are requests that were caught in the BearTrap.  Light purple are HTTP GET requests, and dark purple are POSTs.
Blue are requests that Reputator caught without BearTrap, using its other reputation sources.  Light blue are HTTP GET requests, and dark blue are POSTs.

The graph clearly shows the fourfold increase in spam we’re catching once we turned on all of Reputator’s data sources.

Beating spammers is a cat and mouse game.  We improve our ability to catch spam, without accidentally catching legitimate traffic, and, in turn, they improve their ability to sneak past our filters.  With Reputator, we’ve up-ed our game in yet another way.  Reputator doesn’t block requests from suspicious IPs, like our old method did.  The requests are still allowed through, but they’re flagged internally so they don’t count as page views, leads, or conversions.  From the spambot’s point of view though, there’s no difference.  They can’t see that they’ve been blocked, so they won’t know that they need to adapt their techniques, or move on to a different IP.  This gives us a bit of a leg-up in the cat and mouse game.

Keeping spam out of our customer’s page views, leads, and conversions will be an ongoing battle with the spammers.  With Reputator, we’ve made our first move, and positioned ourselves to be able to refine and adapt as we see how it works, and how spammers’ techniques change over time.

Derek Lewis,
Senior Software Developer

15 CommentsLeave a Comment


  • Reply

    Lou Sturm

    3 years ago

    Such a great post Derek!

    I’m really excited to see a fourfold increase in the spam Unbounce is catching. Big, big shoutout to you, and the Developers at Unbounce for making this happen.

  • Reply

    Corey Dilley

    3 years ago

    This sounds like an amazing standalone product that other online software could use. Did you build it, because nobody’s yet built something this good for stopping spam? And could the Reputator be easily used by other software companies?

    Also, great intro :)

    • Reply

      David Dossot

      3 years ago

      Great questions Corey.

      We’ve built Reputator because we’re facing a spam profile that’s slightly different from the typical blog form spam: for example, we have to deal with crawling bots generating undue visitor traffic, without posting any form data. We have also noticed that the threat is moving very fast, faster than what people report to popular spam databases. Hence our homegrown system and its capacity to analyze traffic in a near realtime fashion.

      Regarding your second question, we have secret plans for offering our spam index over a public API, so others could benefit from our spam hunting. But, hush, that’s just between the two of us :)

  • Reply

    Carter Gilchrist

    3 years ago

    Awesome writeup Derek — so excited to see the results over the last week here. Your whole team did a great job getting this out there and monitoring the various stages of release.

  • Reply

    Rick Perreault

    3 years ago

    Great stuff. Our customers will be super happy

  • Reply

    Oli Gardner

    3 years ago

    So innovative. Really impressed with the angles you guys took to solve this.

    It’s a weird parallel to think back to having to deal with *real* SPAM(tm) as opposed to digital spam. Fortunately, I grew up in a poor corned beef family rather than a poor spam family.

  • That’s a great data-driven article that makes perfect sense to me. Love the graphic – that put’s it all right into perspective. Obviously, you guys have to guard your technology in order to keep making sales (nobody wants dirty data!) – but I have to say I’m very impressed with your robust answer. Did you call it Elk because it tells spam to “vamoose”? :)

    • Reply

      David Dossot

      3 years ago

      ELK stands for “ElasticSearch Logstash Kibana”, the three applications that make our indexation and search system.

      But to be frank, your explaination of the acronym is much better :)

  • Reply

    Stuart Mitchell

    3 years ago

    Well done folks, its great to see a good solid explanation of what’s been done and the lengths you’ve gone to to take the issue seriously and get it solved. :) Thank you!

  • Reply

    Kenji Sano

    3 years ago

    I have a landing page hosted in another site and it’s getting a lot of spam from “forums.Darodar.com” I am planning to move that Landing page to Unbounce. How can I know “forums.darodar.com” is in the blacklist of unbounce spammers?

    Thank you

    • Reply

      David Dossot

      3 years ago

      Kenji, what do you mean by “getting a lot of spam from forums.Darodar.com”. Do you mean people are following links to your site that they found in these forums?

      • Reply

        Kenji Sano

        3 years ago

        Actually “forums.darodar.com” is a non existent page, I’m not sure how it works but since Google analytics said it was my main referral I google this url and found many people having the same issue and they recommend to block it via the .htaccess.

        http://www.sudorank.com/guide-how-to-block-darodar-referral-spam-to-your-website/

        • Reply

          David Dossot

          3 years ago

          Thanks for the extra information. We do not currently have the possibility to filter visitor traffic based on the page they come from (ie the “Referer” header mentioned in the article you linked). It’s something we will consider and discuss internally, as a mean to detect potential bots.

          This said, the crux of the problem is not that real visitors (human beings with browsers) are

          following these links but is that bots are following them and hit your page with bad traffic. In that case, the Reputator has a good chance to find and block them.

          As a quick experiment, I’ve looked for how much traffic we got last month from a Darodar referrer and the answer is none :)

          • Kenji Sano

            3 years ago

            Hi David,

            Now I have my landing page at Unbounce but the referral Darodar keeps showing. What can I do to block this referral?

            Thank you

          • David Dossot

            3 years ago

            At this point, the best would be to open a support ticket from within the Unbounce application. We will then have all the context necessary to help you best.

Leave a CommentPlease be polite. We appreciate that.

Your Comment