Speed up Docker image builds with cache management

Nov 09, 20223 min read

Over the past few years I have been working in multiple IT projects where the Docker platform was used to develop, ship and run applications targeting various industries. In addition, I have conducted many interviews and over the time I have noticed that many DevOps engineers do not pay enough attention to the details and essential elements of the platform.

Therefore, I decided to collect and summarize information relevant to create optimal Docker images.

I see five critical areas that are often overlooked or misused by developers. This is the first article related to this topic. Please see the full list of topics that will be covered below.

Proper use of cache to speed up and optimize builds.
Selecting the appropriate base image.
Understanding Docker multi-stage builds.
Understanding the context of the build.
Using administrator privileges.

I encourage you to start writing an optimal Dockerfiles journey now.

Speed up image builds with cache management

What interests us the most in creating Docker images is the ability to add custom files and execute commands, such as installing dependencies or compiling code. We can achieve these goals very quickly by creating a Dockerfile that resembles the steps we would perform in a console.

An example for Node.js, where npm is the dependency management tool:

FROM node:18.7.0-alpine
COPY . .
RUN npm ci

In the above example, we are using the Node.js 18 image in a lightweight version based on a Linux distribution called Alpine. We copy all the files from the current context and start installing dependencies using the npm ci command.

Simple right? However, this is not the optimal approach from Docker's point of view. This is because it uses an internal mechanism that allows you to reuse the layers of an image you built earlier. This mechanism will not be used if we leave the Dockerfile as presented.

The image build cache is not very complicated to use. When copying files to an image using the ADD or COPY statement, Docker compares the contents of the files and their metadata with those it already has (using checksums). If nothing has changed, it will use the previously prepared layers, including for subsequent RUN instructions. Encountering another ADD or COPY it will check the files again, and so on until the end of the build process.

Our Dockerfile should therefore look in the following:

FROM node:18.7.0-alpine
COPY package.json package-lock.json ./
RUN npm ci
COPY . .

As before, we use the same Node.js base image. This time, however, instead of immediately copying all the files into the image, we first copy only the files responsible for the information about the required dependencies. We run dependencies installation and finally copy the application code.

Let's now follow what the process of building this image will look like, assuming that only the application code has changed and the dependencies remain the same. Docker will encounter the first use of the COPY instruction, checking its resources it will find that nothing has changed, so it will use the previously prepared layer. In the next step, it will compare the contents of the RUN instruction. It remains the same, so you can also use the existing layer here. In the last step, it will copy the new application files into the image, since it's the only place where the changes have been made.

How much time and IOPS we have saved, will be understood by anyone who has at least once seen the size and number of files in the node_modules directory, where Node.js stores packages. After all, this is not an exception, similar dependency management can now be found in many languages/environments.

It is worth noting that installing changed dependencies is just one of many tasks that are performed less frequently than actual application code changes. Sometimes there is a need to prepare the appropriate directory structure, permissions or users accounts. All of these operations should be declared in the Dockerfile as early as possible, and using this rule is the easiest way to properly use the caching mechanisms in Docker.

If you want to know more about Docker best practices check out the next article in this series: Selecting the appropriate base image