DockerCon

Docker Offshore: Smoothing the Waves of Remote-Location MLOps

Lewis Hein, Programming Analyst, Western EcoSystems Technology

Recorded on November 9th, 2023
This presentation describes a pilot system for monitoring wildlife activity in offshore wind farms, where physical access is severely limited and internet connectivity may not exist at all. This study shows how containerization technologies such as Docker can help MLOps be successful in these regions.

Transcript

Welcome. Today, I am going to tell you about taking Docker into the wild west, or maybe just the wild ocean, and how to do MLOps in places you never imagined. So as most of you know at this point, I am with Western EcoSystems Technology. That’s where I work, and we are an environmental consultant company, and we focus on helping wind energy companies navigate the regular landscape, the compliance landscape, and things of that nature.

This project I’m going to be showcasing was funded by a grant from the U.S. Department of Energy. However, none of what I’m going to say is in any way an official opinion that was endorsed by the Department of Energy. Also this was a team effort, and these are the other people in the team. They made it all the success alongside the rest of us. So, very grateful to them.

Table of Contents

    Overview

    Before I dive in, I’m going to give a bit of a roadmap of where we’re going. First, an overview of wind energy and why we care. And why doing the environmental monitoring we do is difficult off-shore. Then I’m going to discuss how edge ML can provide a solution to that and how that solution in turn involves new problems. Then I will mention a little bit about how Docker helped us address these. Finally, I’ll discuss a pilot study we did for technology validation for all of that and share a few results.

    Before I dive deep into the technical weeds of all the software we built and where we put it, I’d like to start with some motivation. Why would we even do something like this? Basically, energy comes from somewhere. All the lights in this room are run by electricity. The scooter I rode this morning was electric. Increasingly, we need electricity to provide a standard of living. And that electricity has to come from somewhere. Sometimes it comes from a coal mine, for example, this one in my native state of Wyoming. And, sometimes it comes from a wind farm.

    Wind energy

    In general, wind energy has emerged as an energy source with more upsides and fewer downsides than many of our other energy sources. That said, wind energy is not perfect, and we need ways to manage these imperfections. So, it needs to be carefully monitored to mitigate danger to flying animals, especially because bats are protected by federal law under the Endangered Species Act. And bats are protected under the same Endangered Species Act, at least some species of bats, and also by some state laws.

    We have a pretty standardized protocol for doing this monitoring on land-based wind farms. To demonstrate that, I’m going to ask you to imagine that there was a wind turbine in the center of the stage here with spinning blades. So imagine that wind turbine there. And along comes a bird that isn’t looking where he’s going, flying along, and the spinning blade comes and hits him. And, sadly, he is no longer flying. That is an event that we need to record. We need to monitor these types of events, because they can inform how wind farms are managed. But, just because it happens doesn’t mean we automatically know that it happens. So the next step is to find a human. This was actually my first job out of college. You draw a square around the wind turbine. The human walks, looks at the ground, turns, walks, and is like, oh, here’s a bird. And now you do a bunch of paperwork to record that.

    Off-shore challenges and advantages

    Now imagine trying to do that off-shore. I don’t know how many of you have tried walking on water lately. Last time I tried, it didn’t work very well. But, there are a lot of advantages to off-shore wind power. Specifically, off-shore wind turbines can be larger. I live in a state that has a lot of wind energy, and every single day — well, maybe not every single day, but most days — I see these massive turbine blades being squeezed through our infrastructure.

    When you are working off-shore, you can basically build a blade as big as you want, and put it on a boat, and take it wherever you need. Also, as wind turbines get larger, they are more efficient. And finally, off-shore wind is a more reliable source of energy, which helps contribute to grid stability. But, like I just demonstrated, wildlife collision monitoring in the traditional sense is based on land and makes a lot of assumptions about, if something falls, it will stay on the ground. A person can walk in the place that the turbine is.

    We need a different solution to make this work in an off-shore environment. And, I should say, monitoring off-shore is still important because there are still animals using the air space in the off-shore environment. What we do is we use computer vision. Instead of walking around like I demonstrated and finding the results of an animal blade collision, we need to detect that event in the moment that it happens or at least to get a video recording of it. Computer vision is one of these technologies that has sort of transformed the world without a lot of big press. But it can be very good these days at automatically detecting objects in an image or in frames of a video.

    This means that detecting animal blade collisions can be done in two difficult steps. Step one, we record some video of the turbine. Step two, analyze the video. Thank you, any questions? As you may have guessed, this is much more difficult than it sounds. There are a few constraints that make it more difficult than it sounds.

    Constraints

    First of all, the Atlantic Ocean is not famous for internet connectivity. And even if you have internet connectivity, there are many regulatory agencies who are very anxious to tell you, no, you cannot connect to that. It’s involved with energy infrastructure. This means that we may have no connectivity.

    Not having connectivity means that we need to process all of our data in the same place that we collect all of our data, which we call the edge. This sort of collides with another feature of the system, which is that we need many cameras per turbine to get good coverage of those blades as they move and as the turbine itself turns. All in all, this can come to a few hundred gigabytes of data per day, which again we have no connectivity to send that data anywhere else. And if you were to try to do the naive thing and simply store all that data on hard drives in the turbine, you would run into another problem that you can’t necessarily get there for months at a time. And having enough hard drives to store this amount of data for a few months would be sort of a ridiculously large pile of hard drives. So we don’t do that. Instead, we have a, well, you can think of it as a compression algorithm. We use the computer vision system in real time to look at the data feeds. And it examines these data feeds and like all compression algorithms — at least all lossy compression algorithms — it makes a call about how data is relevant and important and what data is dispensable. But, doing all of this does not come free.

    There are also some things our hardware needs to do for us. It needs to work reliably in the edge environment. More importantly, this is a hardware constraint, but also a whole system constraint. It needs to work continuously with no supervision. If you’re going to send it out into the wide world for eight months or something, you really want it to still be working correctly when you come back. Also, this needs to run in real time because of the lossy compression nature of what we’re doing. And, finally, it needs to be fairly easy to set up. You do not want to send someone out with a thick instruction booklet and say, do this, because that is error prone.

    Solutions

    There are solutions for all of these. To make it run continuously and autonomously, we decided we would just have one edge device and try to harden it and make it as reliable as possible. We had a lot of infrastructure around this computer vision model for things like data acquisition, and data processing, and data storage.

    We built this into a microservices-like architecture. It’s not the typical microservices you might think of running in AWS or Google Cloud or something. They were all hosted on our edge machine. Each microservice was in a separate Docker container. You could have one for video acquisition, for performing machine learning inference, etc. This allowed us to develop all of these microservices with different teams who were experts in those specific domains.

    Our machine learning team could develop the service that did inference. And our development team could develop services with things like procurement data management, through interacting with the cameras and data acquisition.

    Using Docker

    Because we were using Docker, this was one of our first hints that Docker was going to be really good for this. We realized we could just agree on our interface contracts at the microservice boundaries and just make sure that other teams knew what we expected of them, and what we were providing, and then just work in our own little containerized world and do our own job well. This also allowed us to test our own services and have some confidence that those tests would be meaningful for reliability in the production environment. Also, for installation because we could host on Docker Hub, this made installing new hardware or testing hardware just as simple as running a Docker pull.

    We took all of these services, and we coordinated them in a Docker Compose file that could be pre-installed on the machine, so actual hardware installers didn’t have to think about it. And so the installation process, once we had our containers built and hosted, was as easy as just pulling the container images and getting test data and doing a quick run of the Docker Compose to validate the installation.

    Also, like I said earlier, we needed to keep up with data in real time — and not only keep up with data, we needed to keep up with a firehose of data. Docker helped us do this in a few important ways. First of all, we could use some very fulfillment battle-tested solutions such as Redis, such as Triton, such as the Nvidia stack. Doing this without Docker is definitely possible, but it’s also very painful, especially when dependencies of one thing begin to conflict with dependencies of the other. And we realized that having the Dockerized versions was invaluable to us, because we could just type docker pull, Triton, and the version number. And we had a working Triton server.

    I would also point out, calling back to all the speed improvements that were discussed on day one, Docker is fast. It is incredible to me that you get almost VM level of isolation with very, very low overhead. Additionally, the nature of these sorts of projects is that especially at testing time, you are adding sensors, subtracting sensors, and changing your workload. And having this microservices architecture enabled us to scale on demand by just adding Docker containers or turning off Docker containers we didn’t use.

    Deployment

    So, deployment, this is where it begins to get especially tricky. A lot of these off-shore environments are very technical environments, as you need a lot of training to even be there. You need training to be on a boat or a helicopter. You need training to help interact with the turbine safely. Getting this training for a member of our team could take many months of their time.

    But, there are already people who are amply qualified to go into that environment. These are the people who will work on wind turbines. Now, these people do not necessarily understand our system, nor should they be expected to. That meant we could take our computer and preimage it with all of the files and the Docker containers and images we needed and hand it to these people.

    Then as sort of a final step, we set our edge device to boot on power, such that it would boot up as soon as it was plugged in. However, simply booting up is not enough. There are our NAS systems, there are cameras, there are network switches. So we registered our Docker compose as a systemd service and used systemd to start the Docker compose at a specified time. But, additionally, we wanted to do health checks to make sure that no service started before its associated piece of how it was ready.

    Again, Docker was incredibly powerful for this, because the health checks available in Docker compose. In some cases, we could build a Docker container that its exclusive job was to check on a dedicated piece of hardware, and be healthy when that hardware became ready.

    Pilot study

    This is all nice in theory, but I want to discuss some results. To get said results, we did a pilot study. This pilot study was land based, partly because of what we had available and partly because in a pilot study it is very useful if you can, in a sense, cheat, if you can walk up to your hardware and troubleshoot it, if you can SSH in even though the connection wasn’t good. But we also tried in this pilot study to be very cognizant of how this system behaved if we could not log in or babysit it.

    We developed a prototype system, which I have already discussed, and we tested it on a land-based turbine owned by the University of Minnesota. Very grateful to them for letting us install this in their research turbine. And I will point out that even though this was land based, the connectivity was not good. Another thing we discovered is that despite the fact that wind farms are energy-generation facilities, the power supply to a piece of equipment in a wind farm is not always as consistent as, like, the power supply to your home or whatever. So that made it all the more essential that our system needed to autonomously try to start working whenever power was connected to it.

    Surprises

    Another big win for Docker on this project came about a week after our initial deployment when we discovered that some of our cameras, which were supposed to be saving data with a schedule. Actually we’re not saving data at all. For reasons. What those reasons were, we never found out, because it was easy to take a Dockerfile and spin up some RTSP streaming, which is a really standard protocol used for security cameras. And post deployment, it was easy just to package up this new microservice, send it off, and integrate it with our system.

    And another time that all of this theory was pretty strongly tested, came a few weeks into deployment where we discovered that our system was behaving in a slightly unexpected way. I don’t know what a typical power consumption graph over time looks like, but it’s not this, and our system is doing this. After bootup idle, it would draw a very small amount of power as expected. As our processing pipeline started to work, power consumption would increase, also as expected. But then something very strange would happen about a minute later.

    We discovered this as we were frantically debugging the system and running some statistics on the output of the APC access, that the power consumption of our edge device would randomly begin to skyrocket to the point that suddenly the device would crash. And that was probably a good thing, because it was probably a safety system tripping for able current. Unfortunately, this happened, like I said, in the middle of a major deployment, and we were sort of stressed out about, what are we going to do if our device is broken? As we said at the time, the joke was that IoT was the Internet of Toast.

    Spoiler alert, it was much easier than we thought it was going to be. Thanks again to Docker. Because we put so much effort into a containerized workflow and an easy deployment, it was actually pretty easy to set up our new edge device. So, we got some hardware, and we assembled it. We installed some GPU drivers. That was probably the hardest part, honestly, and it wasn’t even that hard. Copy the systemd and Docker compose files, and then we could just type “docker compose up” over the SSH connection. And, while this was certainly unwell coming in the middle of a deployment, I really would like to highlight that this is about as good as a surprise redeployment can go. And that is very much thanks to the fact that all our applications were containerized, and those containers contained all the libraries that we needed, all the dependencies, without us even having to think about it. Well, we had to think about it, but we didn’t have to manually get all of those ducks in a row. They were just automatically in a row, thanks to Docker.

    Results

    So, some results of the pilot study. After the little hitch of adding a device to the Internet of Toast, our system ran unattended, for weeks at a time. I believe it ran for six weeks after that until the pilot study ended, and all we did during that time was look at some system logs and then show the system was up. If it had been in an off-shore turbine with no connectivity, it would have done its job flawlessly during that time. And this study analyzed in real time over 6,000 hours of video, and in real time, sorted relevant video from non-relevant video, achieving about a factor of 10 data compression weight. And our algorithms did detect several collisions. Now, with all data compression, there comes up the question of how do you know that the lossy compression algorithm only threw away things you didn’t need? And that’s a very legitimate question, and one you need to ask.

    Therefore, alongside this technological pilot study, we did traditional monitoring, where we set out a person to walk around on the ground and look at the ground like this. And the results from that monitoring were statistically comparable to what we did. Not only that, but we checked on an individual basis every event that our system detected and that our monitoring technician detected. And we had basically a one-to-one correspondence that everything, the traditional method detected, we also detected, except maybe one, and we may have found one event with our system that the technician did not.

    So, it shows a lot of promise, for not only working in the off-shore environment, but for producing solid, repeatable, scientifically defensible results in the off-shore environment. We also learned some impotent lessons, both through these off-shore deployments, but also through any time that you’re taking a machine learning model and sending it off into the wider world. Machine learning models, in order to make a difference and have a positive impact in the world, need to interact with the world. And that interaction needs to happen in places that may not be convenient for the developer; they may not have connectivity.

    I know we often use the metaphor of Docker as a shipping container through software. It’s even in the little logo. And I think that metaphor is very applicable here as well, but with a slightly different spin. Because in the world of cloud computing, the shipping container like this of Docker is that AWS does not care what’s inside your Dockerfile to an extent. They don’t have to know all the details, they just have a standardized exterior to get a hold of and work with. But, when you’re taking these machine learning models to the edge, it’s slightly different. The shipping container metaphor still holds, but it’s kind of like your ML model is a meal you will be preparing in your kitchen. And you put your kitchen with all your ingredients, and your stove that you’re used to cooking on, and your oven that you know its idiosyncrasies, and you put all of that into the shipping container and take it to wherever you want to cook the meal.

    Lessons learned

    We also learned a few important lessons along the way. That containerizing almost everything from the start is very, very powerful and very, very helpful. And this is nice. It has almost no cost. There is a cost to getting your team up and running with Docker, training people to use it. But once you have paid that runtime investment, it is really no more difficult or slower to develop in a container. And when you do that development and running in a container, you unlock a wide world of places that your machine learning model and your ML apps infrastructure can go.

    Also containers are undoubtedly useful in the cloud. I helped with this project and looking back, it was still very surprising to me just how powerful containers at the edge can be. As we demonstrated, containerization helps deployments and redeployments go smoothly and scale with computational load. And this is true whether you have one computer in one turbine or 200 computers in two turbines. I really think it’s probably true even if you want to run a machine learning algorithm on an IoT device or refrigerator somewhere. A lot of the constraints of edge ML are very similar across edge ML deployments or even just edge software deployments.

    Also, something I cannot highlight enough, the reproducibility of guaranteeing that your dev environment matches your product environment gives you a lot of peace of mind when you send your complicated application off into an edge device into a refrigerator or an IoT device.

    Conclusions

    And finally, a couple of things I’d like to point out about this video. You can see the model drawing boxes around flying animals. But at the end, you’ll notice right there, you’ll notice a fast-moving animal that sort of cruises in. This was a pretty common pattern that our model would pick up on things that we ourselves did not immediately see. And this for me suggests that we can take these established ML technologies and with the power of Docker and the power of containerization, we can take these technologies to new places and solve new problems and bring a level of quality and a level of deployment and speed and accuracy that was previously unachievable thanks to the power of combining ML with Docker. Thank you for listening.

    Q&A

    Any questions? Filtering out insects is a surprisingly different problem, because insects are everywhere. At the resolution of the camera that we had, a moving insect close to the camera could really look like a moving bird far away from the camera. There are a few things you can do. But one of the most powerful ways to do it is to use a time-series model on the movement of the insect itself. That is kind of how to do it. But another thing you can do is you can only pay attention to things that you know are not insects. Like they’re big enough, or they move in a certain way, or they have a certain shape.

    The question was did we have a problem with hardware failures. Like I discussed, we had a sort of very dramatic hardware failure. We also had other small hardware failures that came from the fact that this was a prototype system built with components that we had on hand. Things like Ethernet cables that didn’t quite connect or got slightly corroded. To anyone who’s trying to do anything in the field, I would say find someone who’s an expert in hardware — whether that is somebody you become, or somebody you hire, or a colleague of yours — and ask them to find every possible place your system will break. And then, try to reinforce those places with proven solutions. Like there are connectors that are proven to be weather resistant. But don’t just assume, because it all fits together and works and claims it will work, that it will actually work under stress.

    The question was, since we had six cameras, were we sampling every frame from every camera. The answer is yes, with an asterisk. So, two of the cameras ran at one frame per second, and that was just the frame rate we set them to. The remaining cameras ran at full-frame rate because — it doesn’t necessarily look like it from far away — these turbine blades move pretty fast, so you need a good temporal resolution. And, in case you’re wondering, yes, that was most of our computational load was just processing those video streams.

    Any other questions? Yes? Well, next we are looking for technology partners in like energy companies who want to partner with us and deploy this on an actual offshore wind turbine. Now that the technology is validated, we want to take this validated technology and do, in some sense, a combination between the bigger validation and a test deployment. I’m sure there will be lessons learned when we do that. We are hopeful, and we’re currently searching for partners to put it on their wind farm and to give them some data.

    If that is all the questions, thank you very much for your attention.

    Learn more

    This article contains the YouTube transcript of a presentation from DockerCon 2023. “Docker Offshore: Smoothing the Waves of Remote-Location MLOps” was presented by Lewis Hein, Programming Analyst, Western EcoSystems Technology.

    Find a subscription that’s right for you

    Contact an expert today to find the perfect balance of collaboration, security, and support with a Docker subscription.