Recipe for a Perfect Storm
TL&DR; Last Friday, the response time for serving pages to Europe-based visitors climbed to unacceptable levels (15 seconds per page or even more), forcing us to temporarily redirect all European traffic to our page serving infrastructure in the US East Coast. A combination of elevated response time from S3 and illegitimate traffic hitting our pages made an implementation weakness in our page server surface and, eventually, lead to this perfect storm. We’re really sorry for this inconvenience and want to let you know we are working hard to prevent this to happen again.
In the rest of this post, I’ll detail some of the actions we’re taking to alleviate the issue. As you’ll see, we had a few blind spots in different parts of our systems and are dealing with them actively.
Simulate Badness
Any one running a web server in the open on the Internet nowadays can testify that it is a wild world out there and nasty things happen all the time. The same of course happens to us: going through our access logs, we can see that bots are constantly attempting to forge malicious requests towards the pages we are hosting. These bots, of course, don’t take no for an answer: it doesn’t matter to them if a page returns 404, they’ll keep hitting it like crazy. Though we have filtering and caching in place, any request that hits our page servers has a cost. We perform regular load tests and thus, have a pretty good idea of the performance profile of our page servers.
But we had a blind spot: up to now, we didn’t include in our load tests the kind of random crap bots throw at us. We’ve started to close this gap and have already made some very interesting discoveries. For example, here is what we’ve noticed with VisualVM, after adding random 404s to our load test:
These red squares (not the orange ones, I know it’s hard to tell) are our HTTP processing threads being blocked. This is plain nastiness. A thread dump allowed us to identify the issue deep in the bowels of the JDK XML support. After reporting the issue to Amazon, since this errors is coming through a call to their S3 client, we’ve been able to put a fix in place that leads to a much better threading profile:
Nasty thread blocking begone! Now we’re able to process requests in parallel with less hurdle. Speaking about parallel processing, another ingredient of this perfect storm was related to asynchronous processing of requests.
Embrace Asynchrony
At Unbounce, we’re big on building reactive applications. We’re evolving our systems architecture and infrastructure towards more asynchronous processing wherever we can. But, again, we had a blind spot… Take a look at the following diagram showing the state of our hosts in Europe, at the time of the incident:
Our servers were flapping in and out of service like crazy. Why was that? Simply because the handler used to process health check requests from the load balancer (ELB) was used to also deal with page requests. Since our page request handler was overflowed with nasty traffic and was dealing with S3 being slower than usual, the health check calls started to be affected and took too long to respond. The ELB did what all good load balancers do when a host doesn’t respond on time to its health check: it took it off its pool of active back-ends. Then it tried to put it back in, kept it up for while and then pulled it out, as the health check would fail again. Hence the flapping.
We’ve solved the issue by changing how page servers deal with incoming requests, as illustrated in the following diagram:
Any request that is handled by a processor that is not I/O bound will be served synchronously, otherwise it will be served asynchronously. The behaviour we’re expecting from this change is a cleaner degradation under adverse conditions. Even in the incapacity of serving pages, like for example if S3 is having issues, page server instances will remain healthy but will produce easier to understand errors, like HTTP 503s. A corollary of this change is that we will need to better monitor our page servers.
Know Thyself
Speaking of monitoring, another issue that was revealed during the storm is our lack of proper alerting on page response time. We are crazy about monitoring: we have dashboards that allow us to peek deep into our systems and have related alarms when things go wrong. But, again, we had a blind spot on the particular aspect of page response time. We’ve already remedied to this problem and will now be alerted of this type of issues. We’re also working on increasing our capacity to deal faster with such issues.
Weed from Chaff
Finally, we’ve launched a new initiative destined to better protect our pages, and therefore their related statistics, from the nasty traffic that constantly hits us. Our goal is to build a system smart and autonomous enough that it will sort the bad traffic from the legitimate one, as requests come in. We’re on the case and, as always, are committed to make this awesome so, please, stay tuned and look for future announcements in this matter…
As always, it’s only when the fan gets hit that the unknown unknowns get known (you may want to re-read this sentence!). We’ve learned a lot during this storm and have taken numerous actions to make things better. Thanks for your patience, the future is brighter, we promise :)