Here at Grand Rounds, we’re pragmatists when it comes to choosing the technologies we work with on a daily basis. Often, that means using what’s tried-and-true, such as Ruby on Rails. But not always. Sometimes the state of the art can deliver win after win, even amongst the inevitable trials of using unproven software. Plus, the cutting edge is exciting and…actually, screw that. Stable systems are vastly superior to any of that nonsense.
“Honesty”, our custom CI system, was originally built on CoreOS. A year and many trials later, we’ve moved completely off the CoreOS ecosystem. We’re still using Docker, but now we’re on the very familiar Ubuntu, and using Docker Swarm for clustering. To document all of the weird problems we worked around or never quite got a handle on, while keeping the team’s builds moving along smoothly towards our Series C round and beyond, would be a sort of fishing tale: of interest mainly to those who were there, but pretty boring otherwise.
One struggle that popped up recently may be helpful to document though. After
flipping the switch on our new Ubuntu cluster, we started seeing significantly
slower build times, on the order of an additional five minutes. Digging in, we
found that the slowdown was coming from our parallelized RSpec processes
starting up. Eventually, we figured out that the additional time was due to the
--storage-driver being different between Ubuntu and CoreOS.
Ubuntu was using
aufs, and CoreOS
aufs, it appeared that
the parallel RSpec processes were loading their respective files in serial,
each taking about 15 seconds. At 24 processes, the 6 minute slowdown is
accounted for. With
overlay, all 24 processes loaded in the expected total
time of ~15 seconds. After re-provisioning our Ubuntu instances to use
--storage-driver=overlay, we started to see expected build times again. Huge
In The Flesh
We’ll probably want a multi-core system to reproduce this behavior, so I spun up an
m4.2xlarge EC2 instance with the official Ubuntu 14.04 AMI, with a 50GB EBS
volume attached (called /dev/sdb). Let’s get on the machine:
To run the
overlay storage driver, we’ll need at least the 3.18 Linux Kernel. Install:
Get back on the train, and install Docker:
1 2 3 4 5 6 7 8 9
Log out and back in again to pick up the group change, and do some additional setup:
1 2 3 4 5 6
Run the docker daemon:
1 2 3 4 5
Instead of the Grand Rounds secret sauce, we’ll be using discourse as an example project. The load time for its Ruby runtime is large enough to represent our problem well.
I’ve stuck all the discourse setup into a docker image. Run the datastores required for our test along with the discourse container:
1 2 3 4
We’re ready to test our parallel problems, but let’s get a baseline figure first:
1 2 3 4 5
The “files took 5.3 seconds to load” is our baseline. Let’s run the
1 2 3 4 5 6 7 8 9 10
Eight rspec processes have run, and the “files took 25.16 seconds to load”.
It’s clearly not (baseline x 8), but it’s much higher than expected. And
We’ll have to re-do our setup, but dockerization ends up saving us quite a bit of headache:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Here, the eight rspec processes took 6.04 seconds to load their files. It’s very close to the baseline.
This analysis certainly has confounders, but it captures the pattern of poor performance we were seeing in production on nearly the same stack and in an easily reproducable way.
I don’t think we necessarily learned any profound lessons from this. It mainly served to reinforce the fact that you can’t plan for everything when you make big changes, and that an in-depth knowledge of your system and good troubleshooting skills are the only things that can save you from yourself.