DockerCon

Cut Through Vulnerability Noise with Runtime Insights

Christian Dupuis, Senior Principal Software Engineer, Docker

Alex Lawrence, Principal Security Architect, Sysdig

Recorded on November 4th, 2023
Developers working with pre-release scanning tools find themselves drowning in a deluge of scanner output. Finding vulnerabilities is never a problem — determining if a risk must be addressed is the real challenge. Learn how Docker Scout and Sysdig can help.

Transcript

Good afternoon. Christian Dupuis here today, introducing you to Alex Lawrence from Sysdig. We’re going to be talking about the integration that we’ve built with Sysdig runtime insights. Just as a kind of context setting, what is this about? You saw in the keynote that we are on a mission to grab a lot of data and bring that data into the developer context in the inner loop. That’s really what we are trying to achieve with Docker Scout. The goal is to greatly reduce the cognitive overload, the noise, that a developer has to deal with when working locally to constantly improve their security in their supply chain. And one of the great partnerships that launched is Sysdig.

And now we want to dive a little bit deeper than what we showed in the keynote. So with that, let me hand it over to Alex to show you Sysdig. I think Sysdig is offering a free trial for everyone coming from Scout as a Scout user so you can try this out on your own infrastructure. Alex, take it away.

Thank you. Basically, my goal today was to not bore you with a bunch of slides and talk about random stuff. I was just going to show you things inside of the Sysdig interface and have it really be focused on the stuff that’s live, right? What’s happening right now? Then, we’ll hand it back over to Docker at the end to show the integration into their side of the house.

Table of Contents

    Sysdig interface

    This is the Sysdig interface. It’s all about what’s going on across your environment, across that whole CNAPP portfolio of things. This is a view of cloud activity, right? So this would be live things happening against cloud configuration. And I can look at this in a couple of different ways. Right now, I’m looking at it by account and region, resource type, things like that. But maybe I suspect Mateo’s account or somebody’s account is not acting correctly. And so it lets me do a live search. I can do things like Mateo, up here, and it’ll show me just Mateo’s stuff. Then I can say I only care about Mateo, maybe delete actions. So if I just start typing delete, it’ll filter to any Mateo delete, right? I can basically look across all of my events that I have happening in the cloud and look for very specific things, right? I can see every time it’s deleted something in this context. I could remove Mateo, or delete, get back to all of my events, and say, well, just look for any delete action and start sifting through all of my stuff live, right?

    This gives me kind of that perspective of what’s going on right now. And what should I care about at this point in time? I can take the data, and I can look at it in a number of different paradigms. So that was with cloud activity. I could look at it specifically from like what are the users doing across all of my clouds, be that AWS, Google, Azure, whatever it might be, and see it from a person context. And I can do that exact same searching where I can look at anything I want. Any string term I put up here will go and do a quick search.

    If I did cluster, I could find cluster deletes or remove the leads and look for any action that was taken against a cluster specifically. It lets me look through all that stuff in a very kind of novel and fast way. I could also reorganize this to go away from users or cloud stuff, and I could look specifically at stuff happening in Kubernetes or containers. I can narrow it down to just pods or workloads.

    What’s happening?

    Basically, it lets me explore visually what’s happening in my environment right now and what stuff should I really care about at this point in time? This one’s kind of fun because it’s also starting to show us things that are being rejected from a particular cluster. So I can see up here I’m getting events around rejected a container because it failed a scanning evaluation, right? So if I had someone build a container that didn’t have the appropriate things passing for whatever the reason might be. We actually get events on that fail, and I can even see the level of detail where things aren’t being admitted that are trying to be pushed into production. So it’s really just about like all of that live data and what it’s actually doing.

    All of this stuff is coming from this event section, but it’s letting you look at it in a very visual way to explore it in the context of how does your infrastructure actually look. In this view, it’s kind of giving me that more traditional perspective across all of my events. What is kind of interesting about this is it does let me get pretty deep into various details of things going on, and it lets me have access to a few more robust pieces of filters.

    Filter and prioritize

    One of the things I like this for is, if I am running a security organization and I have sock concerns, or I’ve got particular MITRE concerns that I’m mapping to, or PCI compliance, whatever the thing might be. I can start adding filters for those different compliance specifications, and then I can go and look for live violations within those different frameworks. Right now, my nearest is from six hours ago, so I don’t have any from 10 minutes ago, but it lets me say, well, I need to assure my compliance with this particular specification. Let’s go look at my events holistically and see if I’m complying or not complying with a particular spec. Right. So this lets me use that data in different ways and different use cases for different teams.

    A lot of the other stuff that we’re doing here is to integrate into our customers’ lives in a more meaningful way. What I mean by that is too often when you’re dealing with a security organization as an end user or vice versa, you’re kind of talking a different dialect. You’re speaking the same language, right, but you’re using different words, or those words have different meanings than what you’re typically used to.

    One of the really good examples here is around, say, posture management. It’s a key area that we all have to care about. It’s all about configuration, you know, making sure that our stuff is set up in a way that is hardened and appropriate. And most of the time, it’s looking at specifications. Again, things like PCI, like NIST, best practices around Kubernetes, containers, Docker, whatever it might be. At the end of the day, you generally end up getting reports about misconfiguration.

    As a security professional, my job is to basically say, okay, I’m going to prioritize what I think are the most important things for my teams to go and fix. And then I’m going to give them this report. I’m going to go download it as a PDF, and I’m going to hand it to them over the wall and say go fix all these things. And they’re basically going to say, great, security gave me this gigantic list of things to do. I have to go look up every single misconfiguration. I have to go look up how am I going to fix that? What am I going to do?

    And so a lot of tools have said, okay, great, you know, we don’t want to inundate these folks with things that are erroneous or are difficult to figure out. So let’s go look at things like, you know, running as root. So they’ll give you little directions, and then they can have something they’re going to follow. But wouldn’t it be nice if your tools spoke the same language as your operations team? Or spoke the same language as your developers? What we’re trying to do is take it a step further and say, okay, we’ve got this specification, you know, we shouldn’t be running containers as root yet, we all know that.

    So, let’s go find all of our root containers across our environments. If I go highlight one of these, I can actually go and see the particular risks, the particular containers that are violating it. I can see why they’re running as root. In this case, just no one set who the user should be, so it’s just defaulting to root. Let’s go fix that. We can go run this as 1,000 and 1 or whatever our value is we use in our company. And when I do that, it actually generates a patch. Right? I can come in here, and, when I open up my Jira or whatever my ticketing system is, I can actually include a spec that they can apply directly to the cluster and go forward with. Or I could take it even a step further. If I want to, I can actually integrate directly into the repository, and I can open up a pull request directly in their workflow.

    The nice thing about this is, again, I’m not just giving a report to somebody saying go fix these things. The security professionals are saying, okay, here’s what the value should be. Go take this spec or, better yet, go review this PR and apply that to your cluster.

    Actionable events

    The goal here is that we’re trying to take all these conversations about misconfiguration, about events, about things that we’re seeing live in the environment, and trying to make them be actually actionable, by the end users. Because the last thing we want to do is give people big, gigantic reports that they do nothing with.

    In the example of the runtime context, we saw all of these great, fun events up in our insights view. But what can I actually do with something like that? So, I have this event here. Let’s go find it.

    I believe the one I want to look at is under Kubernetes activity, and we’re going to go look for things that relate to the terminal. We’ve got this event here called terminal shell and container. This is something that we probably all have done in our lives. We’ve all gone into a container, popped open a shell, and made a configuration change. That’s probably fine in most cases. But in your production environment, you probably shouldn’t do it that way, right? So, that’s something you need to be uniquely aware of, watch for those types of events, and things like that. But what actually happens when those things went on?

    If I come looking into my events, I can see that I’ve got a bunch of things happening. Let’s go find one in particular. I can go look for this one here — we’ll just highlight it. The nice thing is that when you do find these events in our world, since the way we instrument is fairly deep into the nodes, it’s giving you a lot of supporting data around it in particular. In this case, I can see that someone logged in as a shell as root. They did a run command and pulled down some data with Bash. Basically did a bunch of stuff they shouldn’t do, right.

    This gives me the whole process review of it all. But I can start getting a lot of context as well around the process names, the parent process that executed it, arguments passed via the parent, PIDs, all sorts of other and sundry metadata about what happens, who did it, and what the context was, even down the cloud level data or Kubernetes data.

    The reason it matters is that when I can provide a robust set of metadata around an event, I can make the event actionable. I can be able to say, hey, it was in this namespace, it was by this user, this pod, this whatever it might be. I need to go do something about this. I need to go take this and do more, right. It lets me have a fair amount of data in an event that actually lets you do something with it. Again, it’s not about just throwing something over a wall, it’s trying to give you more supporting characteristics.

    The other one in here that is pretty relevant, I think, for you all, is what we are doing with Docker and how do we take all of this runtime data, all of this context around right now, and do something more meaningful with it? When we decided to start doing this with vulnerabilities, it was really about how do I take data again at runtime and do something novel with it. So I can look at all of my running clusters, and then I can go look for misconfiguration or, more importantly, vulnerabilities in those clusters in a context of like, how was it actually deployed?

    Security perspective

    The nice thing here is that this is very, very useful for a security professional or an operations person. Admittedly, this interface, this view of the world, it’s not the best use case for a developer though, right. You’re not going to often have a dev come in and be like, I want to go see live in the environment where this thing is running at, what namespace, and go look for vulns within that context. You might have a few who do, but that’s not where they live, right. That’s not a developer’s daily life.

    From our perspective in Sysdig, you know, that’s not the person that we go after, that’s who someone like Docker Scout uses, that’s someone who other vendors go and use. So we took all of this data that we had in here and made it interesting to view in different ways. I can go look for things like “has fix” or “has exploits” or, “is in use, has fixes.”

    This basically lets me take a large number of vulnerabilities and filter it down to stuff that matters the most — again, the most right now. I can generate reports of this, and I can give it to folks to go work on, which is nice. It lets me basically prioritize my time better. It lets me focus on what matters most at this point in time.

    But wouldn’t it be great if I could take this “in use” data and plug it into a workflow that a developer actually cares about? If I could stick it in the terminal they’re working in and they’re doing their scans locally, or if they’re using an IDE of some sort, or using a Docker Scout plugin, you know, some sort of tooling that they have in their everyday life. What can I do with it, how can I leverage it there?

    That’s why we’ve partnered with Docker and others to take this data out of our platform and bring it directly into theirs so that when your developer is doing the analysis they can say, oh, okay, I can see that in this container these are the packages that are actually in use by the container or the image that I’ve built. These are the ones I need to care about. Everything else I should be pulling out of that image. I shouldn’t be having to deal with those anymore. I shouldn’t have to remediate those anymore. I should only care about these ones in particular. So, on the system side, this is kind of where it is, right. It’s all about this particular view of giving the operations folks access to it.

    Developer perspective

    Our friends here at Docker with Scout have taken this data and made it look much more interesting from a developer’s perspective. So from this point, I’m going to give it over to Christian and let him talk more about that, about the vision they have, the cycle it goes through, and how to integrate that into your everyday life.

    Thank you, Alex. Before we do that, I actually have a question for you. How do you guys do this? Like what’s the technology behind the text package design use?

    The technology we use for this is our agent that runs on the actual infrastructure itself. It looks at every single system call that traverses the kernel of every single node that we’re on. And it correlates those calls back to the libraries and packages that we’re compiling, and we’re making those calls themselves. It can figure out exactly what package is being used by the container and know exactly where it comes from within the container itself. And that works across all ecosystems, all languages.

    Integration

    Before we get into questions, I want to talk a little bit about how we build this integration and what you can do with this now. Sysdig has an API that we kind of call into. And the way you get started is to log into Docker Scout. The place to start is in the integration section. And you scroll down and see a Sysdig tile here. And this Sysdig tile in this particular case is configured, so you can manage this. I could say I want to add a new one. And then it takes you through a link to the documentation that explains how to configure the Sysdig agent — the Sysdig infrastructure to expose the runtime insights as introduced by Alex via an API. So you have to plug in like an API token. You have to select your cluster name, your environment, and then you’re off to a good start.

    Once you’ve done this effectively, two things happen. Let me switch over to a different organization here. There are really two things that we are collecting from Sysdig in order to help developers make better choices on where to focus their energy. This is as shown in the keynote. This view effectively lists all the images that someone has pushed in your organization. This is just my test account, so not a bazillion images.

    There is this drop-down up here that you can use to filter down by environments. Now environments is something that Sysdig is able to communicate to us. These images are running in a particular cluster as a particular workload name. You can name and logically group these things. So you can say this cluster is production, this other cluster is staging, or you can keep the names as reported by Sysdig. Then you can filter down the list of things to make it really easy for developers and people interested in this kind of information to know what images are currently running in your environment X. That’s already a great value I would say. Then, we took it a step further, and this is where the in-use package information comes in.

    A look at VEX

    I want to take a step back and introduce you to a concept called VEX. So VEX is the acronym for Vulnerability Exploitation eXchange. That is an emerging specification that allows providers of software to publish information about particular CVEs in the context of a particular package — subcomponents included — to say: I’m affected, I’m not affected, I’m investigating. So you can really specify to your consumers of your Docker images of your, say, NPM projects or something else. You can say, we confirm that we are affected. And then you can do in-line statements around what you expect your customers to do, your consumers. You can say I want you to upgrade to the next available version, and so on.

    In this case, this particular CVE, I want to address in the context of a particular Docker image. Within that Docker image, I’m looking at one particular package. Scout supports VEX already. VEX can be uploaded into the database helping you internally. And, if you say, I want to effectively communicate that I’m not affected by a particular CVE, you could ignore this. There’s already a mitigation in place. Or, you can say I am affected, and everyone needs to upgrade to the next available base image version, for example. We use this technology, this integration, to automatically create these VEX statements and then publish them into your Scout organization, which is effectively a mapping to your Docker Hub organization.

    The data that we are getting from Sysdig is essentially mapped into a positive and a negative VEX statement. For the packages in use, we can say, you know what, these packages are used. Sysdig told us that these are actually loaded and runtime and, with that, we say, okay, you are affected. For the other packages that are not in use, we say you are not affected, because these things aren’t loaded at runtime. And that is a VEX statement that is from that point on available to you.

    VEX doesn’t specify, or make any strong assertions about, what you do with these statements. Do you have to accept them? This really boils down to trust, and where these statements are coming from. Do I have to trust them? Do I have to accept them? In the Sysdig case, it’s a very strong signal when they buy our integration and tell you that this is not in use — it has a very strong meaning. Don’t trust any random VEX statements off the internet, right, that someone has published to Docker Hub or any other registry. Make sure to validate where this information is coming from.

    In the context of official images, we are actively working with the official images team as part of Scout to deliver VEX statements for these images. In this particular case, of course, this is trustworthy, it’s part of your Docker Scout installation, and you should really look into this data.

    How does this look when you attempt to use this information? I can run a Scout command, and there are effectively two ways of looking at this. Let’s scroll up to the top. That’s a really high signal, besides the noise, that a package is in use. You should really look at that vulnerability, right? So if we scroll down, this is one of our test images that we have deployed into a cluster where we have Sysdig configured. And, when you scroll down, you’ll see, okay, these CVEs you should really all look at, because the underlying package is loaded at runtime.

    This is what this “package in use” is indicating here. It’s coming from the Sysdig integration and the recommendation is to update to the fixed version of that package of that CVE. Again, this is a VEX statement. It could come from other sources, but in this case, using the Sysdig integration, you’ll get this straight out of the runtime information. Further down, you’ll see that there are a bunch of busy boxes not affected, so it’s very easy to say, you know what, this is a shell, I don’t need that. I should probably remove this, as Alex said, or it’s fine for me to ignore this until I see a different indicator that I should really start looking at it.

    There are various filter options here, so I can take a quick look. I don’t actually remember all of them. There is a VEX author, which is kind of a first attempt at saying, okay, you can specify the VEX publishers that you as a consumer want to trust. In this case, I could put in the string Docker/Sysdig integration, and I would say only the statements coming from Sysdig are accepted. If I would find a random VEX statement, it would still not apply, so you would still get the CVEs reported as you would expect. There is also a way of loading VEX from the local file system, and so on, but that really isn’t in scope for this talk.

    I think that’s pretty much the content that I had planned. Do you have any questions for us? No questions, everything’s clear. Who’s going to try this out? Show of hands. One two three, right.

    Conclusion

    Let’s see if we can find the link to the getting started page — that would probably be helpful. You can try Sysdig for free. There’s a form to fill out that puts you in touch with Sysdig. I hear it’s very low touch; you don’t get to talk to a salesperson directly, but you get a trial, and you take it from there. It’s all pretty easy to set up. It’s all containerized, so it’s pretty easy to go forward with, and then you play around with the product and use it in fun ways. Perfect. Thank you very much.

    Learn more

    This article contains the YouTube transcript of a presentation from DockerCon 2023. “Cut Through Vulnerability Noise with Runtime Insights” was presented by Christian Dupuis, Sr. Principal Engineer, Docker and Alex Lawrence, Principal Security Architect, Sysdig.

    Find a subscription that’s right for you

    Contact an expert today to find the perfect balance of collaboration, security, and support with a Docker subscription.