What The Page Builder Team Learned From Our First Fire Drill
We’ve started running “product fire drills” at Unbounce to help us react better as a team when things go wrong (as they do from time to time when you build products).
On July 7th, we held our first such drill and, thought it would be a great idea to share our experience and learning points.
First, I’d like to go over how we planned to set this up. After each sprint (we run 2-week sprints), we hold a retrospective to talk about what went well and what can be improved. During a recent retro, we thought it would be a good idea to go through the steps of a fire drill scenario to help us prepare. We decided that once per sprint while Alex was testing, he would pick a bug that he finds and call a fire.
Here’s how the first drill broke down.
Paul K – The Captain.
Mark W – Customer Success Rep.
Suman/Slan/Johnny – Fire fighters.
Matt/Tavis – Observers, coaches.
Alex – The Fire Starter.
Captains Log: Post all data to Captains log for review during post mortem
Battle Bridge Hip Chat Room: Cobra Fire Drill
Start Time: 3:11pm July 8th
End Time: 4:01pm July 8th
We sent out three emails all prefixed with fire drill to our development team.
At 3:11 a fire alarm was pulled. Customer Success stated that some customers were complaining they could not save their changes in the landing page builder. Repro steps were not available.
Note: Things started off a little rough. Even though we talked about doing this drill, we never really planned how we should do it and figuring out who would do what ate up a little more time than we expected.
We split up into our roles and opened up our Incident Management Wiki page.
Suman and I started debugging within the app and Johnny started looking through our editor error logs. At this point we still weren’t sure the severity of the issue.
Note: When we started, we talked about first finding the severity of issue so we could post it into the battle bridge. Although we continued trying to find the repro steps, we failed posting the severity into the battle bridge. CS posted a link to an edit page into the battle bridge which, as it turned out, contained the problem.
While Suman and I continued digging in, Tavis suggested to Johnny that he should use our feature flagging tool to try and determine what feature gates were opened recently as a way to uncover what was causing the issue.
Learning Point 1: Cross-Training Is Important For Success
Team wide knowledge of our tool base is a necessity and therefore, a primer needs to be scheduled to make sure we all know how to use our full tool base. I for one did not know we could use our tool to determine what was opened recently.
Note: Having a captain is crucial to keep everyone on focus – it can be difficult to keep track of any findings in the moment when there’s a lot of chatter happening both verbally and in Hip Chat. It seemed to take (I felt) a long time to find the repro steps and it was frustrating not really being able to tell what the severity level should have been. It was also hard to switch from a thought to hip chat and back to gather in findings and clues.
Suman, on a hunch (he worked on some code that touched the problem area) found the repro steps. He posted his finding into hip chat. I was finding it a little frustrating that even though I had the steps to reproduce I still could not. It turns out that I had been pressing the ctrl-s key to save and not been pressing the save button.
Learning Point 2: It Boils Down To Details and Communication
We need to be as specific as possible when posting any findings into hip chat.
After hearing from CS that this issue actually was pretty widespread and many customers were complaining that they couldn’t save and with us now knowing what the repro steps were, we determined that this issue was severe enough to warrant a roll-back.
Learning Point 3: Clarification Matters
We have two ways we can roll back – we can do a code roll back or we can turn off a feature gate. We declared a roll back but it was not clear which of the two we meant.
We told CS that we found the issue and would turn off the feature gate. We also told them that pressing ctrl-s or cmd-s would force the page to save. This meant that even though we did a roll back, pages that were opened before the roll back would still have the issue unless you used the keyboard shortcut to save the page.
Learning Point 4: The job isn’t done until it’s done
Even though we found the issue and resolved it with a roll back, our job was not done until we fully understood the impact to all our customers.
After the drill, we had a post-mortem and chatted about our learning points.
What we’ll do next time:
The first time as expected, was a little chaotic. We hope that by continuing to practice preparedness, we’ll become more relaxed and the communication process will be smoother. A few other things we’re going to try:
1. We’ll have someone play the role of a customer so we can understand how this might impact them.
2. We’ll have a couple of the fire fighters simulate being remote by sitting somewhere else in the office.
For now, we’ve set a goal of hosting one fire drill every sprint.
What does your team do? Are you running fire drills or other trouble-shoot scenarios? We’d love to hear how you’re approaching things in the comments below.
Senior Developer & Page Builder Lead