DockerCon
Machine Learning Inside the Container
Jinjing Zhou, CTO, TensorChord
Transcript
Today I’m going to talk about machine learning inside containers, and how we bridge the gap between development and production environment for machine learning practitioners.
I’m Jingjing. I’m from TensorChord, and I’m the CTO and co-founder. Previously, I worked as an AWS engineer. I mainly work on deep learning on graphs. I’m the founding member and the core developer of a deep graph library project, which is a framework to help apply scientists to machine learning on the graph. And currently in TensorChord, we maintain three products: envd, openmodelz, pgvecto.rs. We are an AI infrastructure company, and we love open source very much. So our major products are all open source and focusing on the AI infrastructure. You’re welcome to check our GitHub report; my GitHub account is @VOVAllen, and I’m glad to speak here.
I’m going to talk about how to contain your machine learning development environment with our tool called envd. At the left side is our lovely logo. We incorporated the envd letters into the logo. And, everyone loves cats, so we made it a cat logo.
Table of Contents
Why containers?
The first question is why containers, so why do we think containers are good for data scientists to use as their development environment. So here’s the question that bothers data scientists a lot. The first one is that it is hard to set up environments for different products. The core reason is that in the machine learning area, Python and C++ are heavily used, and C++ libraries always have a chaotic dependency problem. And, it becomes much worse with CUDA — you need to do some compensation on the GPU and CUDA version with the deep learning framework version, and that becomes a nightmare for data scientists. It’s hard for them to set up isolated environments for different products.
The second problem is that data scientists learn a lot about models, learn a lot about algorithms, but they know less about infrastructure. Based on our investigation, we found that actually only some of the data scientists have heard of Docker, and only a small amount of them have used it. It’s very rare for the data scientists to actually know how to write a good Dockerfile to build a proper Docker image.
We’ve even seen in some companies that their scientists just do the Docker commits. So you just do everything in Docker and commit layer over layer. Finally, it becomes like a 20-gigabyte image, and nobody knows what’s in that. We think it’s pretty bad practice for everyone. The data scientist doesn’t know what’s actually in the environment, and the team does not know if this image is safe or if there is any security problem inside that.
The third problem is that it’s hard for data scientists to reproduce other people’s work. There are many new algorithms coming out every day. And, the major work of scientists, they need to spend a lot of time to reproduce others’ work, and reproducibility is always a problem in the engineering area. We think a container is a good tool to improve this situation, like if we make it easy to use and become a standard, then everyone can reproduce others’ environments or works easily.
Tools
Here’s a figure that shows the complexity of the Python ecosystem. With only Python, usually on people’s development machines, such as macOS, you have multiple Pythons in your laptop, like the system built-in, or the home-brew Python, and many times you don’t know what’s actually the Python that you need. Suppose you did a pip install something, and it goes to the system built-in Python, and you can’t find it in your Python 3. We found it’s quite common if you do everything on the whole machine.
And there are some existing solutions for this, such as Anaconda, but they mainly focus on the Python ecosystem, which is only part of the problem. Here I listed something not included in the Python environment tools. So you have some development tools, which are not included in the Python ecosystem, like some editor setting and also CUDA. There are complex CUDA dependency problems in the deep learning framework ,such as PyTorch. Many tools are binding to a certain version. So we think it’s not enough. And also there are many more C++ libraries, such as OpenCV or OpenMMLab or MMDetection. There are many libraries that are C++ and not included in the Python ecosystem. We think containers are a better tool for machine learning scientists to manage their project and the development environment. So you can have a separate image for each of your projects.
envd
How do we help machine learning scientists to containerize the development environment? So we return to CLI tools to help users to create the container-based environment for machine learning. The syntax looks just like Python. It’s actually a dialect of Python called Starlark, which was originally brought up by Google. It’s actually the language using a basal building system. And we use it so users can use Python-like syntax to raise their environment requirements. They are already familiar with the Python syntax, so it’s easier for them to get on board for these tools. You just define a build function and describe your environment needs such as the base operating system image, the CUDA version, and the package you need. And with that file, you just run “envd up” and that’s all. You are getting to the container rights development environment like this. You can see it’s already the latest version, and it will say okay you are in the envd environment.
The other wonderful feature is that you can predefine many recipes of the common user tools. Here we use TensorChord as an example. Envd lib is our main recipe repository. We provide many recipes for common user tools, so you can directly call the tools in envd lib to get it set up. So like for TensorChord, you need to install the TensorChord setup, it’s in the container. Without envd, you need to change the entry point of the image, and that’s only one part. You might have multiple processes that need to run in Docker simultaneously. So it will become complex if you’re writing a pure Docker file.
The third feature is also enabled by the Python syntax. A common situation is that the production environment is slightly different from the development environment. So you want to include dev tools as much as possible in the development but want to remove that and minimize your production image as much as possible. Here, the upper function defines the core dependency. You have some Python package, which is unified in both dev and production. And then in the dev image, you can declare more tools you need, and for the serving image, you can just remove them. So you can define something different or define the entry point. So the serving image can be run out of the box.
Demo time
Next, I will do a bit of a demo. Here’s like an empty folder with build.envd. It’s our custom extension name called envd. I’ll write a simple build function here. So it defines the base, defines a Python package. We want just an empty package to do the demo, and we have the Jupyter notebook set up and just call envd. I think I already had a call, so I need first to remove that. And I just do an “envd up”. So it will start the building process and have everything installed. So because I’ve run the build process before, you can see everything is cached. The user will get the environment out of the box. Here, I’m already in the envd environment in the container.
I can see the container here. Here’s the Python basic, and it will expose the port defined. And also we’ve embedded sshd server inside the container, so you can easily use the environment on a remote machine, which is a common scenario for machine learning workers. You usually develop a little bit on your own development machine and then go to a powerful cluster with more power for GPU. So, we think a remote development like SSH is needed. We will add the entry to your SSH config file, so you can just do “ssh python-basic.envd”, and you can also get into the environment.
Why not Dockerfile?
Why not Dockerfile? Why do we need to define our own build file format? The first problem we found is that it’s hard to reuse some parts in the Dockerfile. Let’s say you have some library that’s hard to install, such as MMDetection or some custom operator, so it does not have a well-distributed format. So you write once, but when you need to reuse that, you can just copy and paste from one Dockerfile to another. Finally, it will become a mess when we need to update those parts.
The second problem is we think the domain knowledge of Dockerfile is not so suitable for data scientists. As I mentioned, data scientists are good at algorithms, they’re good at modeling, but for the Dockerfile part, it’s more likely they do not have so much experience on that. It’s hard for them to write a good Dockerfile. And I would say it’s also hard for other engineers to write a really good Dockerfile. There’s so many things to take into consideration — like, how do you properly design the layers, and how do you reuse the cache? Like, you have to have images, and you want to install PyTorch. How can you reuse the pip cache among them and how, when you build the image, do you use the files on host? How do you run Jupyter notebooks in the container? And also how do you run the Jupyter notebook and the TensorChord simultaneously in the container? These are all like real problems if you want to use a container as your development environment.
We feel it’s hard for data scientists to use such tools. So we think our Starlark-based syntax is simpler and much familiar to the data scientists for them to use.
BuildKit
Here are the internals of envd. We heavily rely on BuildKit. BuildKit is actually the next-generation build engine for Docker. It’s actually a low-level build library to construct Docker images, and it’s Dockerfile agnostic. So Dockerfile version 2 supports BuildKit, but you do not have to do the Dockerfile. Any developer can define their own frontend language to use it.
Another fantastic feature from BuildKit is that it supports parallel build. In the Dockerfile, you can only do everything linearly, so one step must happen after the other step is finished. But with BuildKit, you can parallelize different steps. It overlaps the network aisle, so it can greatly accelerate if you have steps that can overlap with each other. Then you can merge two steps together later, which helps improve the build speed.
The third part is cache efficient like in BuildKit. It has a special design for the cache library, so you can easily share cache between different builds. So like you have a different project using PyTorch, you can share the pip cache among those builds. But the point is that BuildKit is very low level, so you need to redefine the interface for users.
Here’s an architecture of using BuildKit. So you have your product folder and with an extra build.envd. The envd file is ready for your environment from development to production. And inside envd, we have StarLark language compilation to convert the envd file into our internal intermediate representation. Then we will convert it into the BuildKit LLB. So LLB stands for low level build. It’s represented inside BuildKit to construct the image. And when we finish the LLB construction, we send it to the BuildKit daemon. So the daemon will handle everything about the build and produce an image. Ffinally, it will send the image to the Docker daemon. So you can see the image inside Docker ps as another image.
Here’s some benchmark comparing envd to the Dockerfile. We are faster than Dockerfile on the first build by about double the speed. This is done by parallelizing the Python install step and the APT install path step. We have other interesting roles to parallelize the users building script. We also have specialized optimization for the review. It’s very common for developers to change their development environment, so they might want to have some more packages in immediately. We optimize for such scenarios, so it becomes about I will say like six times faster than Dockerfile in the rebuild scenarios.
Remote build
The other part we support is about remote build. So users can write the build.envd file locally and run the build on a remote machine. So we have a concept of context in envd. So you just create a new context called remote build, and the specified builder as a TCP address, and that’s all. It’s also powered by the BuildKit.
This is pretty useful, for example, when your build process needs lots of CPU resources such as in a machine learning workflow, where you need to compile something from C++, like from the source code. You need lots of CPU resources, and also the libraries in machine learning are pretty large — like TensorChord or PyTorch — they are gigabyte level. So you need good network bandwidth to make the build faster. To make it happen on the remote machine, you can put the remote machine in the public cloud. So everything will work the same but much faster, because a public cloud machine has larger network bandwidth, and you can give it more CPU cores. For the developers, it can also help improve their development experience. You do not have to run everything on your macOS, which feels bad, like your machine goes super hard and spends a very long time building the image. It’s kind of a waste of time.
We found remote build is a good feature especially for the team CI/CD workflow. You can kind of outsource the build project to a centralized machine. So it also makes the management of cache much easier, so you can reuse the cache as much as possible like in your clusters.
Here’s a figure that shows a CI/CD machine, and you do the context creator, so you can launch it from GitHub Actions machine, which has only probably one or two cores, to your dedicated build cluster. And, finally push it to the remote registry, so everything just works out of the box. We feel like it can greatly reduce the CI/CD time, which also saves developers time.
Conclusion
That’s all my talk today, and thank you so so much for being here. Also thanks to my colleague Keming, who helped a lot on these slides and envd.
You are welcome to star us on GitHub at TensorChord — again, we have three projects now, envd, as I just introduced. Also, openmodelz is a framework to help developers to deploy models on the cluster. It makes the engineering infrastructure there so you only need to add machines to the cluster and ask it to deploy models. The third one is pgvecto.rs, which is a vector search extension on Postgres so you can directly do vector search inside Postgres. That’s all. Thank you so much.
Learn more
- Artificial Intelligence and Machine Learning With Docker
- Docker BuildKit
- Get the latest release of Docker Desktop.
- Have questions? The Docker community is here to help.
- New to Docker? Get started.
- Subscribe to the Docker Newsletter.
Find a subscription that’s right for you
Contact an expert today to find the perfect balance of collaboration, security, and support with a Docker subscription.