Engineering

Docker Storage Driver Performance Issues

Here at Grand Rounds, we’re pragmatists when it comes to choosing the technologies we work with on a daily basis. Often, that means using what’s tried-and-true, such as Ruby on Rails. But not always. Sometimes the state of the art can deliver win after win, even amongst the inevitable trials of using unproven software. Plus, the cutting edge is exciting and…actually, screw that. Stable systems are vastly superior to any of that nonsense.

“Honesty”, our custom CI system, was originally built on CoreOS. A year and many trials later, we’ve moved completely off the CoreOS ecosystem. We’re still using Docker, but now we’re on the very familiar Ubuntu, and using Docker Swarm for clustering. To document all of the weird problems we worked around or never quite got a handle on, while keeping the team’s builds moving along smoothly towards our Series C round and beyond, would be a sort of fishing tale: of interest mainly to those who were there, but pretty boring otherwise.

One struggle that popped up recently may be helpful to document though. After flipping the switch on our new Ubuntu cluster, we started seeing significantly slower build times, on the order of an additional five minutes. Digging in, we found that the slowdown was coming from our parallelized RSpec processes starting up. Eventually, we figured out that the additional time was due to the default Docker --storage-driver being different between Ubuntu and CoreOS. Ubuntu was using aufs, and CoreOS overlay. With aufs, it appeared that the parallel RSpec processes were loading their respective files in serial, each taking about 15 seconds. At 24 processes, the 6 minute slowdown is accounted for. With overlay, all 24 processes loaded in the expected total time of ~15 seconds. After re-provisioning our Ubuntu instances to use --storage-driver=overlay, we started to see expected build times again. Huge relief.

In The Flesh

We’ll probably want a multi-core system to reproduce this behavior, so I spun up an m4.2xlarge EC2 instance with the official Ubuntu 14.04 AMI, with a 50GB EBS volume attached (called /dev/sdb). Let’s get on the machine:

1
me@localhost$ ssh ubuntu@<instance_ip>

To run the overlay storage driver, we’ll need at least the 3.18 Linux Kernel. Install:

1
ubuntu@ec2-instance$ sudo apt-get upgrade -y linux-generic-lts-vivid && sudo shutdown -r now

Get back on the train, and install Docker:

1
2
3
4
5
6
7
8
9
ubuntu@ec2-instance$ sudo -i

# paste all these lines at once:
apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo 'deb https://apt.dockerproject.org/repo ubuntu-trusty main' > /etc/apt/sources.list.d/docker.list
apt-get update -y
apt-get purge -y lxc-docker*
apt-get install -y docker-engine
usermod -a -G docker ubuntu

Log out and back in again to pick up the group change, and do some additional setup:

1
2
3
4
5
6
ubuntu@ec2-instance$ sudo -i

# paste all these lines at once:
service docker stop  # we'll be running our down docker daemon
mkfs.ext4 /dev/sdb
mount /dev/sdb /var/lib/docker

Run the docker daemon:

1
2
3
4
5
ubuntu@ec2-instance$ sudo -i

# paste all these lines at once:
rm -rf /var/lib/docker
docker daemon -D --storage-driver=aufs

Instead of the Grand Rounds secret sauce, we’ll be using discourse as an example project. The load time for its Ruby runtime is large enough to represent our problem well.

I’ve stuck all the discourse setup into a docker image. Run the datastores required for our test along with the discourse container:

1
2
3
4
ubuntu@ec2-instance$ docker run -d --name=redis --net=host redis:3.0
ubuntu@ec2-instance$ docker run -d --name=pg --net=host postgres:9.4
ubuntu@ec2-instance$ docker run -it --rm --name=storage-driver-test --net=host -w /root/discourse grnds/docker-storage-driver-test bash
root@ec2-instance:~/discourse# . ../setup.sh  # now we're in the container

We’re ready to test our parallel problems, but let’s get a baseline figure first:

1
2
3
4
5
# Ctrl-C the rspec command once it starts printing dots, because we only care about the load time
root@ec2-instance:~/discourse# rspec
...^C
Finished in 6.68 seconds (files took 5.3 seconds to load)
...

The “files took 5.3 seconds to load” is our baseline. Let’s run the aufs test:

1
2
3
4
5
6
7
8
9
10
# this will print out lots of stuff, and we don't care about most of it, but
# we can't kill the process early or the information we want won't get printed.
root@ec2-instance:~/discourse# parallel_rspec ./spec
8 processes for 330 specs, ~ 41 specs per process
...
Finished in 3 minutes 16.2 seconds (files took 25.16 seconds to load)
560 examples, 13 failures

Failed examples:
...

Eight rspec processes have run, and the “files took 25.16 seconds to load”. It’s clearly not (baseline x 8), but it’s much higher than expected. And compared with overlay?

We’ll have to re-do our setup, but dockerization ends up saving us quite a bit of headache:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# kill the aufs docker daemon from before.
# this is exactly the same as above, other than the --storage-driver.
ubuntu@ec2-instance$ sudo -i
root@ec2-instance# rm -rf /var/lib/docker
root@ec2-instance# docker daemon -D --storage-driver=overlay

ubuntu@ec2-instance$ docker run -d --name=redis --net=host redis:3.0
ubuntu@ec2-instance$ docker run -d --name=pg --net=host postgres:9.4
ubuntu@ec2-instance$ docker run -it --rm --name=storage-driver-test --net=host -w /root/discourse grnds/docker-storage-driver-test bash
root@ec2-instance:~/discourse# . ../setup.sh  # now we're in the container

root@ec2-instance:~/discourse# parallel_rspec ./spec
8 processes for 330 specs, ~ 41 specs per process
...
Finished in 1 minute 6.27 seconds (files took 6.04 seconds to load)
592 examples, 21 failures

Failed examples:
...

Here, the eight rspec processes took 6.04 seconds to load their files. It’s very close to the baseline.

Conclusion

This analysis certainly has confounders, but it captures the pattern of poor performance we were seeing in production on nearly the same stack and in an easily reproducable way.

I don’t think we necessarily learned any profound lessons from this. It mainly served to reinforce the fact that you can’t plan for everything when you make big changes, and that an in-depth knowledge of your system and good troubleshooting skills are the only things that can save you from yourself.