Unplanned Downtime & Our Humble, Sincere Apology
This morning we experienced some unplanned downtime, which is highly unusual for us.
The short version of the story is that we made a mistake, we owe you, our customers, an apology, and we will most certainly work to ensure that it never happens again.
But if you’d like to better understand this morning’s outage and what we’re already doing to prevent these kinds of issues in the future, I first need to explain a bit about how our system works.
The Unbounce service is composed of several distinct components, but the two primary components that you care about and interact with are the public web application (app.unbounce.com) and published pages (available at unbouncepages.com and any custom domains you point to it). High availability is something we focus on here at Unbounce, and to ensure these services are highly available, we operate them out of two data centers that can fail over for each other (Amazon’s cloud hosting in US West and US East). In the event of any systems issues, we’re able to cut over either service between either data center.
This gives us great flexibility when doing systems maintenance. When necessary, we can direct both our public web application and published page services at a particular data center and perform maintenance in the other data center. Typically this all happens behind-the-scenes without you noticing anything (we’ve never had to send a “we’ll be down for maintenance” email). This has allowed us to achieve (and surpass) our target availability of four nines (99.99%) for the public web application, and five nines (99.999%) for published pages.
Over the last couple days we’ve been using this capability to perform some much needed database upgrades to accommodate our (and your) continued growth.
We had successfully cut over both services to our US East hosting and were preparing to perform the next steps in our pre-planned sequence of maintenance steps in US West, and this is where we erred. The maintenance involved shutting down all remaining servers in US West, however due to human error we instead shut down all servers in US East.
Our monitoring kicked in immediately, alerting us of the problem, and the servers were restarted minutes later. The end result was 8 minutes of downtime for published pages, and 11 minutes of downtime for the public web application (it takes a couple minutes longer for our web application to start).
While we design our overall system to provide redundancy in the face of any failure, it’s difficult to design around the human element. In this particular case, we had documented and discussed a detailed change plan, and the particular team member involved had properly communicated their intent to carry out the intended changes. They just made an error.
Now, we already have significant system improvements being worked on that would have mitigated this type of error; specifically our next-generation page serving project (or Page Server 2.0 as we call it internally). From our very early days we had deployed our system on a limited number of servers, with multiple components residing on the same server. This is typical for small companies starting up new online services. We’ve long recognized that this is a weakness we’d need to address as we grow, and over the last few months have been working on this next-generation architecture.
There are two specific features of this new architecture that would have improved things for us, and you, this morning. Firstly, Page Server 2.0 will be deployed in four data centers instead of two (likely in Ireland and Singapore in addition to US West and US East), and we’ll be able to automatically detect and route around failures in any of those data centers (for the tech-savvy folks in the crowd, we’ll be using Amazon’s Route 53 DNS and latency-based routing). With four data centers and automatic failover, a single error in any one data center would only have affected a portion of published page traffic for a much shorter period of time (likely under two minutes). Secondly, we’ll be able to run Page Server 2.0 on dedicated servers, not hosted together with our public web application as it is now. This would have further reduced the likelihood of a complete system shutdown this morning.
We look forward to bringing these systems online this spring, as they will bring new features along with improved availability.
To all our customers who were affected this morning, please accept my most sincere and humble apology.
We understand how critical it is that we deliver maximum uptime to support your campaigns, and will increase our efforts to meet and exceed your availability expectations.
-Carl Schmidt, CTO