Shed that load or bust!
In September 2013, the Reactive Manifesto was published and, quickly, software developers around the world started to add their signatures to endorse it. Unsurprisingly we all want to build better software, that is applications that behave gracefully when things are normal and also when the fan gets hit. The four main principles listed in the manifesto (represented in the diagram below) are not new but they are conveniently bound together in a nice package, explained in details and christened with a name that is easy to remember (although somewhat confusing: it’s unrelated with reactive UI design).
Thirteen years ago a seminal paper from Welsh, Culler and Brewer introduced the notion of a Staged Event-Driven Architecture (aka SEDA) into which each sub-system (aka stage) of a system is decoupled from each other and connected with asynchronous event queues. The main benefit of this architecture (shown below) is that a system built that way can degrade gracefully under load, which is one of the qualities a reactive system should exhibit.
One drawback of this architecture is that it introduces latency between the different stages. In normal operations, this added latency can feel like a problem. It’s when things get hot that it becomes clear that the benefits of this architecture outweigh its drawbacks. Let’s take a look at what happened last Friday morning in our page serving infrastructure. The following graph shows the requests per minutes we were serving in one of our production regions:
The day was beginning with the normal traffic rise when suddenly things went crazy. Someone out there decided to offer us a free load test (thank you!) and check if our page serving infrastructure is truly web scale! For nearly an hour, we’ve received around ten times the traffic we were expecting at this time of day. This would be innocuous enough if we didn’t have, downstream of our page servers, the complex stats processing machinery we’re using to produce awesome traffic reports to our users.
Luckily for us, our page serving and stats processing infrastructures are decoupled from each other, as shown here:
Now let’s take a look at the number of messages in the message queue between the page servers and the stats processors:
As you can see, the number of messages pending processing surged quite high (high enough to trigger all sorts of alarms on our side – yes we monitor *all the things*) and it took a while for the stats processors to catch up and drain the queue. But the good news is these processors were at no time overloaded: they were chewing on messages at their standard pace, not pushing any extra load to the downstream systems and the database. Without this load shedding mechanism, the increased work load these processors would have faced could have potentially overwhelmed our stats infrastructure. But thanks to this design, load was contained to page serving and for the rest of our systems, it was business as usual.
Load shedding is one of the strategies we use at Unbounce to build scalable systems. We’ll keep sharing our experience on this subject, what’s working great and what’s working a little less great… In the meantime, feel free to share your experience with scaling your systems in the comments section!
Director of Software Architecture