Docker is a popular product for running and distributing containerized applications, and Dockerfiles are a common way of constructing those containers. Unfortunately, Dockerfiles and the underlying assumptions they base their philosophy on are anti-patterns for exactly the kind of use cases Docker wants to solve.
There are some obvious issues with running third-party Dockerfiles. Like most of the Docker ecosystem, Dockerfiles were designed for personal use by an individual with root access. Once you start distributing them, however, you’re essentially giving root to a stranger. This blog post is about why you shouldn’t even be using Dockerfiles for your own projects.
It’s a pretty common belief that testing your code and products is a good thing. (If you don’t agree, you may want to stop reading this blog post and go… I don’t know, somewhere else where there aren’t computers.) Dockerfiles are a terrible place to test your product.
They weren’t designed for testing, they use a vastly different code path from the actual running of a container, and are a bit of a bolt-on from before tools started existing to work with and instrument Docker containers.
Do you really want your test dependencies in your container? If you don’t want those layers floating around, you better work on your bash-fu, and make sure you can install, run, and delete your tests in one terrible-to-read line of shell script. Same goes for your build dependencies.
Perhaps what you want do is have a separate Dockerfile for testing that imports the container you made with the first Dockerfile (which probably has your build dependencies in it), and then run the tests there and… what? Mark the first container you built as having failed tests? Delete it?
A significant number of people push Dockerfile changes, let some poor registry build it, and then import that result to run their secondary testing Dockerfile. Now they’ve just shipped before they’ve tested.
In all these cases the Dockerfile is doing very little for you that a bash script wouldn’t be doing in a better, more predictable way, and you probably had to write a bash script to manage the process anyway. You want to be building and testing your product and designing a minimal runtime environment (read: container) for it, not working around the semantics of Dockerfiles and their associated layers.
Layer Storage is Fallible
Your carefully crafted layers are nearly useless. Sorry. In fact, up until the end of last month they were actually causing lots of problems, and it isn’t clear yet that the issue is fundamentally resolved. An approach that does workaround these fundamental issues is a nice flat image. Or, ideally, being able to point at your filesystem directly, bypassing the docker “graph” altogether (I’m looking at you,runc).
Layer storage as implemented appears to have been designed as an optimization for desktop users, and offers decent qualities for that basic, manual case (i.e. a user directly creating and managing personal-use images on their own machine). But, in any automated workflow (as in the datacenter), the minor diskspace and network optimization by using the layer system is not worth the trouble or the risk.
The Registry is Fallible
Following from the above, the next bishop in the church of layers is the registry.
- How do you make sure you get the same version of an image from the registry?
- What happens to your image distribution when your registry is down?
- What version of the registry are you running?
- What problem is the registry even solving for you?
- Why aren’t you storing tarballs on your own reliable storage solution and importing them?
Dependencies Didn’t Go Away
It’s Tuesday, that doesn’t just mean new music (woops, they changed it to Friday?), it also means you’ve already spent at least one weekday this week sitting on an unpatched security vulnerability. Time to update those containers!
Not feeling so immutable now, are we. Your standard workflow needs to involve regenerating containers constantly. You are going to have to do it to keep things up to date so you better get used to it and get good at it.
Thinking of containers or layers as “immutable” has got it backwards, anyway. The term “immutable infrastructure” gets passed around a bunch now, but folks seem to miss the “infrastructure” part. The container was never the atom of immutability, that is the role of the server, the “infrastructure,” the container’s role is distribution and encapsulation and being a bit faster than a VM.
Layer Overlap is Iffy at Best
To top it all off, the actual overlap of common layers is frequently quite small. To test this out I wrote up a quick layer auditing tool (it seems like Docker has been removing theirs, the tools I previously used appear to be deprecated now). Introducing dlayer! You can run it locally and see what kinds of stats you get.
For my local machine, I had neglible savings and almost no overlap:
Tag nodesource/trusty:latest : 12 layers - 356MB (virtual) Tag mongo:latest : 17 layers - 244MB (virtual) Tag tcnksm/gox:1.4.1 : 13 layers - 1702MB (virtual) Tag tcnksm/gox:latest : 13 layers - 1704MB (virtual) Tag ubuntu:latest : 4 layers - 179MB (virtual) Tag nodesource/node:trusty : 12 layers - 386MB (virtual) Tag google/python:latest : 6 layers - 362MB (virtual) Tag tcnksm/gox:1.4.2 : 13 layers - 1704MB (virtual) Tag google/golang:latest : 9 layers - 583MB (virtual) Tag redis:latest : 17 layers - 104MB (virtual) Tag golang:latest : 14 layers - 493MB (virtual) Tag nginx:latest : 12 layers - 126MB (virtual) Tag busybox:latest : 3 layers - 2MB (virtual) Tag phusion/passenger-ruby22:latest : 16 layers - 635MB (virtual) Total : 163 layers - 8832MB (actual) Reachable: 149 layers - 8339MB (actual) 8585MB (virtual) Shared : 5 layers - 165MB (actual)
I also ran it on one of our servers, there we had a bit more overlap and disk savings across about 230 tagged images. Much of the overlap is coming from multiple releases of programming language base boxes (20 ruby, 14 node, 8 python on this particular machine):
Total : 2554 layers - 88132MB (actual) Reachable: 2107 layers - 77631MB (actual) 160832MB (virtual) Shared : 585 layers - 21842MB (actual)
Overall, the layering saves about half the disk space of just having a flat image, a price I’d be very happy to pay for the assurance of not having to deal with the Docker graph or registry.
Fine, What Else Can We Do?
Well, we at wercker clearly advocate building and testing your code in a Docker environment, just not using a tool primarily designed to copy files around to do it :D
Here are some steps you can take to work around the anti-pattern that is Dockerfiles:
1) Build your product in at least two phases, a build/test phase and a deploy phase. Sound familiar? This is a process we described thoroughly in a blog post a few weeks ago: Deploying minimal containers to Kubernetes. In your build/test phase you install your dependencies, build, test, generate an artifact to pass on to your deploy phase, and then throw away the build container. In the deploy phase, with a nice clean, minimal, container, put that artifact in the right spot and wrap it up with a bow.
2) If you can get away without using a registry your ops team will thank you. Don’t have an ops team? I promise you that there are a thousand ways of storing and moving a tarball around more reliably than using the registry. Our CLI even has a
pull command to grab tarballs of any of your builds if you’d like one.
3) Do it one layer. You’re going to want to be rebuilding your dependencies often enough that your savings really won’t add up.
4) Release it with a scratch container if you can, we love producing 2MB containers containing just a Go binary.
5) Use a tool designed for building, testing and deploying your product. Extra points if it knows how to launch the services your code depends on along side it for the building and testing phases like wercker ;)
Earn some stickers!
If you already have a wercker account, don’t forget to tweet out your first green build with #wercker and we’ll send you some @wercker stickers! If you don’t have an account yet, you can sign up here.