Better logging of startup failures

Something is going to break

Failures happen. Especially in complex systems with lots of dependencies. We try our best to design our systems with that in mind.

We run a lot of short lived instances on AWS, and when something goes wrong those instance can be very short lived. If a new instance fails to startup correctly due to a transient failure outside the instance, or something broken with a code deployment. AWS auto-scaling terminates the instance and starts a replacement. This usually happens quietly and before we even know there was something wrong.

At lot of times this great. If one server in a group has a problem, then throwing out the instance and replacing it is the correct move. The main drawback is that diagnosing bugs becomes difficult. Its hard to do standard syadmin debugging when the system is no longer around to admin!

We use SumoLogic for centralized logging on all our AWS instances. Sumo collects and ships logs with an agent (written in Scala) on each server. We manage the agent “collector” with a configuration file created by an Ansible playbook which is run during instance startup.

Until recently we were configuring logging to Sumo near the end of the startup process. We did this because the logging needed to be tailored to the type of service and environment the instance was running in. eg: Logging to production collectors only on production instances.

The major drawback is that configuring logging so late prevents us from capturing problems that happen early in the instance lifetime. (Fun bonus question: How do you log the failure of the logging system?)

To help ourselves out we’ve made a couple changes to the way we setup our instances.

Ship logs earlier

To start shipping logs earlier, we now build server images with Sumo preconfigured with a set of log files that are common on all instances we start. Especially the log files we write during application deployment on instance startup. We still run the Ansible role later in startup to properly setup the correct log files for the instance.

Alternate shipper

We’re also adding failure trapping in some of our script in case things go really bad. This function will fire if any of our scripts fail spectacularly during startup. It dumps the last 256KB of a the specified log file to an SNS notification that goes to the entire SysOps group. (Note: 256KB is currently the maximum message size for an SNS payload.

function finish {
  instance_id="$(ec2metadata --instance-id)"
  message="[Instance Startup Failure] $instance_id: $(tail -c -256000 $LOGFILE)"

  aws sns publish --topic-arn $notification_topic \
                  --message "$message" \
                  --region $aws_region
}
trap finish ERR

Future Improvements

There’s still lots of work to be done improving our instance startup speed and stability. Knowing what’s going on at all stages of the instance lifecycle is the first step!

Mike Thorpe
Technical Operations Manager

Get Inside Unbounce

Better logging of startup failures

Something is going to break

Ship logs earlier

Alternate shipper

Future Improvements