DockerCon
Demystifying Kubernetes Platforms with Backstage
Matthew Clarke, Senior Engineer, Spotify
Transcript
This is “Demystifying Kubernetes Platforms with Backstage.” I’m Matt Clarke. I’m a senior engineer at Spotify. I’ve been working with Kubernetes for about seven years, before Spotify at the Financial Times. And I’ve been working with Backstage for four years, including before it was open sourced. I’m on the deployment infrastructure team at Spotify. I’m Backstage project area maintainer for Kubernetes plugins, along with Jamie Klassen, who works at VMware, who couldn’t be here.
Before we get started, I’m going to talk a bit about Spotify and Kubernetes. We’re responsible for what we call our multi-tenant Kubernetes clusters, which are about 40 clusters all on Google Kubernetes Engine (GKE), that run all of our backend and web services. At our high watermark, high-traffic level, we have about 270,000 pods. So, these are pretty enormous Kubernetes clusters. We have a lot of other Kubernetes clusters at Spotify that do things like ML, data pipelines, that sort of thing. Deployments at Spotify are get ups. Engineers push to prod about 3,000 times a day. And the majority of our services are on Kubernetes. Some of our services are on Helios, which is our legacy container orchestration system.
Table of Contents
Platform engineering
This is also a talk about platform engineering. What does platform engineering mean? The idea is that we build tools for other developers at our organization in order to increase developer productivity and reduce toil. That means we need to treat our internal users as our customers. We’re very lucky because we get to sit beside our customers every day. Some people out there who have to go through panels or go through your customer relationship managers to talk to your customers and get their feedback.
As a platform engineer, you’re very lucky. Then you don’t have those hurdles. You can usually just slack them. It’s great to have a direct line of communication. And since our internal users are our customers, we need to provide support for them and treat our tools like a product. This is mainly through Slack. And we’re going to come back to why that’s important a bit later.
First, you might be thinking, what is there to demystify on Kubernetes? I’ve been working with Kubernetes for five years. And it’s not very hard. Maybe that’s because we’ve been working with Kubernetes for x years. It becomes a bit easier the more you learn about it, but it takes a while. I think the problem is that Kubernetes has a steep and long learning curve. And it keeps going as you dig further and further into things.
As a platform engineer, it’s our job to provide tools and lessen the learning curve and therefore make our engineers more productive. This usually involves getting rid of toil, which is things that we think our users shouldn’t have to do, but they do sometimes. It’s also important to know who your users are. We find that our developers have different interests. You can’t really just lump them all as developers and treat them all the same. Some of our engineers are very interested in infrastructure tools. They dig into Kubernetes. They want to understand it deeply. Then they can utilize the platform and maybe not have as many questions as other engineers who might be hyper-focused on delivering features. They want to deploy for their users and maybe they’re less focused on infrastructure tools. Both are totally valid viewpoints, which is fine; people are different.
Journey to Kubernetes
I saw this image from Palark. And I thought it was a great representation of this steep learning curve for Kubernetes. It’s called the Kubernetes iceberg. And you see at the top there, we have Docker. We’ve got pods. We’ve got deployments. We’ve got kubectl run. These are things that people might already know, or they’re pretty simple to pick up. You can pick up and run with them pretty easily. But then once you understand that slice, you go into the next slice, and you’re thinking, okay, what about ingresses and jobs and configuration? After that, you’re getting into no affinity, stateful workloads, persistent volumes, monitoring. And it just keeps going until you get down to the building blocks of Docker and Kubernetes itself. We try to push our users up this iceberg, and we take those lower iceberg things for ourselves. But this is kind of difficult. But, to understand how we even got here, this could be a very typical journey to Kubernetes.
We’re going to talk about what does that look like at any organization? What does it look like when you decide you’re going to use Kubernetes? And, you want to roll it out throughout your organization. So, typically, you think Kubernetes will help me solve some business goal, or maybe you just think it’s cool, which I do, but it also helps. You might want to use it for advanced scheduling, autoscaling, you want to install some sort of Kubernetes native tool. And you think, okay, this will be much easier if I just use Kubernetes. Then it’ll provide me with these nice features. And you tend to run it in an experimental way. Maybe one team runs their own services on Kubernetes. They monitor their Kubernetes deployments with the Kubernetes dashboard or kubectl. And they dig into everything that’s on those clusters. They own the clusters. They own all the services that run on these clusters. So Kubernetes dashboard and kubectl might be quite suitable for this case.
The problem is, as you find that, oh, actually Kubernetes is quite good at scaling or scheduling or something else, then you start to get more adoption. And adoption starts to go well, and more services start to move to Kubernetes. Or a team says, well, they’re running these things on Kubernetes, and it’s going well for them, so maybe we’ll do that same thing. But we don’t want to maintain our old clusters. Let’s just use theirs. Things go well, and then you get into the more complicated cases where you think, oh, actually, one cluster isn’t really enough. Because we need to run local to regions in order to reduce latency. Or maybe some team needs their own cloud provider, some team wants their own cluster. Things start to get really complicated, and you end up having multiple Kubernetes clusters in production at your organization.
Then you hit this kind of crossroads when you think about, what sort of platform engineering philosophy are we going to have here? Are we going to have multi-tenant clusters? The platform engineers manage the central clusters. And everyone just installs their stuff onto those. Are we going to install it for them? Or are we going to have single-tenant clusters? And the platform team manages the tools for the lifecycle of those clusters. I don’t think there’s necessarily a right or wrong answer. Like everything, it depends on your organization and what you want to do. We’re also going to talk about multi-tenant clusters. But a lot of what we’re going to show is still very important for single-tenant clusters.
The thing is, once you start to provide these Kubernetes platforms, developers start to need to wait and interact with these Kubernetes resources. And, like we said before, maybe some developers know more about Kubernetes. Maybe some developers know less about Kubernetes. And it becomes complicated. It becomes so complicated that you struggle to answer simple questions, like which cluster is my service actually even running on? I forgot. Or like, I maintain this mental mapping of that service runs on this cluster, this one runs on that cluster. And someone new joins and you say, oh, these are just the things you need to know. The secret handshakes of, I know this service runs here, and this one runs there. And that gets really complicated because one of our jobs, like we said, as platform engineers, is to make it easier for developers, not to make it more difficult and make them have to maintain this mapping of service and namespace to cluster.
So we started to get feedback when we rolled out these Kubernetes clusters very early stages years ago, like simple questions like which cluster is my service on? And also, is it okay? Is it in production? Is it running? All right. Is anything wrong? Do I need to do anything? So this is where Backstage comes in.
Backstage
If you haven’t heard about Backstage, it’s a platform for building developer portals. Engineers that Spotify use Backstage every day. They use it for viewing information about CI, viewing documentation, monitoring Kubernetes deployments, and much more. It kind of becomes the one place to go to manage all your Kubernetes services.
If you’ve seen any talks about Backstage or you’ve dug into it yourself, you know that one of the core features of Backstage is the software catalog, which stores information about your services, including who owns it, where the Git repo is, relationships to other services, lots of other things. One of the interesting things about the catalog is that it’s customizable and extensible. So you can actually build plugins or use open source plugins that are appropriate for your organization, including the one we’re going to talk about today, which is the Kubernetes plugin.
Answering these two questions that we brought up earlier, this is what we’re going to do for the rest of the talk. And then we’re going to talk a little bit about how you can do the same.
We’ve heard a lot about building Docker containers, locally deploying them. You know, it went to some great talks about CI/CD. And now we’ve deployed our containers. We’re starting to provide functionality for our users, our developers, to actually figure out what we do in terms of monitoring maintenance, those day two tasks? So this is where Backstage comes in. It becomes the interface to our services. So like we said, we maintain the software catalog of our services, and it integrates with those other tools.The user views their Kubernetes resources through Backstage. You can see an example of this here.
This is the open source Kubernetes plugin for Backstage. I have this service called dice-roller, whose responsibility is to roll dice for me. So I go into Backstage. I see the services I own. I go to dice-roller, and I click the tab for Kubernetes. It just brings me the information about where that Kubernetes service is running, which cluster, the deployment services, autoscalers, everything. You can see straight away it’s saying, you have 19 pods on this service, on this cluster. You’ve got these deployments, and this one deployment also has a horizontal autoscaler set up. And here’s some high-level information about some errors that Backstage has detected on these Kubernetes resources.
One of the interesting things that we’ve done here is we’ve actually changed the way that the user was interacting with Kubernetes. Instead of going to the Kubernetes dashboard, as we said before, I’ll remember that this service is on this cluster. So I must remember to go to the right Kubernetes dashboard, or I won’t see my service. The user just goes to Backstage. They don’t actually have to worry about which cluster their service is deployed on. Backstage will find it for them, and just bring the information to the user in the context of the service for the user. And we’re going to explain what that actually means in a second.
An interesting thing about this is we’ve changed it so that the interaction is the same. It doesn’t matter if it’s service one, if it’s service two, you go to Backstage, and Backstage will find those Kubernetes resources for the user. Like I said, we’ve changed something here. So the user is staying service oriented when they’re performing these tasks for the service that they care about. One of the problems is that the user can context switch when you’re working on a service. You think, okay, it’s gone through CI, and now I want to check that it’s deploying okay, and everything’s fine. There’s no errors. It’s not crashing.
The unfortunate thing is, if you just let them interact directly with the Kubernetes clusters, then they have to remember, oh, yeah, first of all, set my code to the right context, or open the right Kubernetes dashboard. We want to get rid of that intermediate step and say, you looked at the CI in Backstage, and now you just click on this tab, and you’ll see all the Kubernetes resources. And there’s no step between those two things. They’re staying completely focused on the service that they care about.
Common questions
The other question we want to answer is: Is my service healthy? So that example was pretty easy. This one is actually much more complicated, because there’s probably a lot of people in this room who know that a lot of things can go wrong in Kubernetes, and like explaining everything that can go wrong is difficult. So we had this idea, which was, we get a lot of support questions on Slack, which reduces productivity. That’s because developers ask us questions about their Kubernetes service, and then they have to wait for us to read it, investigate it, answer, and there’s a queue time there that really affects developer productivity. They basically can’t do anything until we look at the service and we get back to them.
So my colleague and I had an idea for a hack, which was, can we make it much easier to debug Kubernetes issues? We’ve already got developers going to Backstage to monitor their services as they’re rolling out. What if we just showed them the errors there in that same context? Then they wouldn’t have to dig too much into the specifics of their service or understand everything about Kubernetes.
The common question that we were getting was, my deployments failed, and I actually just don’t know why. It says “progress deadline exceeded”. Can you help me? You might know that there can be a lot of things that can cause “progress deadline exceeded” when you’re deploying on Kubernetes.
So how do we do this? I’ll start with what we had. This is the open source Kubernetes table that shows, it’s basically a table of pods that are running. It’s fine. I’m not super-pumped about it. But one of the things we noticed was it was difficult to scan for errors. A table doesn’t give you a holistic view of all the pods that are running. You might have to go back pages or forward pages in order to see that, oh, something’s wrong with this pod. It’s not wrong with the other pods.
We wanted to think about something that gave you a holistic view, something that was more visual, and something that you could look at and know straight away that something was wrong. You can actually see in this table, it looks fine. You’re like, oh, my service is running great. But actually, all the containers have restarted three times. And it’s not really highlighted anywhere. So you could totally just skip by that and think everything’s great. We wanted to make sure that that didn’t happen anymore. And users were actually finding those errors because we were highlighting it to them.
One of the problems, like I said, is that there’s so many things that can go wrong. So we tried to be a bit data-driven about this. We went to kube-state-metrics, which exposes the reasons that pods terminate or fail to start. We thought, maybe if we can’t solve everything that goes wrong, we can solve a large majority by just looking at the metrics that are being exposed and saying, well, we can just go through these and figure out how to highlight these areas for users.
Our strategy here was, okay, we’ll think about how we would debug these errors and see how we could automate that process. In the errors that we saw before, we broke them down into three categories, which was basically to check the manifest logs or Kubernetes events. And the kind of easier errors that we saw were:
- InvalidImageName — Maybe you put the wrong Docker container, that doesn’t exist in there.
- ImagePullBackOff — Maybe you don’t have permission to actually pull the image that you’re saying that you want to run.
- ItemMemory — Which can be a service issue where there’s a memory leak, but usually we find it was just that you didn’t provision enough memory for the service.
- CreateContainerConfigError — Which is usually where you’re trying to use a secret or a config map and it doesn’t exist.
Those are the kind of infrastructure errors that we saw, but then we noticed that those had actually pretty good error messages from Kubernetes. And we were really wondering, why are the error messages from Kubernetes for CrashLoopBackOff, Error, and Readiness probe, not as good?
That’s when it kind of clicked for us, you know, maybe it seems pretty obvious to actually when looking back, that error CrashLoopBackOff and Readiness probe failures are actually the errors that cross the boundary from infrastructure tools into the runtime of the service. And you basically can’t know what’s going wrong without, you know, integrating into tracing or having some sort of great monitoring setup.
When a user comes to us and says, my pod has crashed and it says error, what’s going wrong? You know, I’m an infrastructure engineer, I actually don’t know what their service does usually. So I’m like, well, I kind of need you to tell me what’s going wrong here, because I don’t know why it crashed either. So, you know, you dig into it, you maybe say, oh, maybe these logs help.
These are the errors that, while you can’t give the user the reason that this error has occurred, you can get them very close. And you can get them much closer than we were before by, for example, showing the logs, showing Kubernetes events, and at least giving the user as much context as you can about why these errors are happening. One other thing was that we wanted to link to the correct docs at the correct moment. So the user didn’t feel like we had shown them something and they still didn’t understand what we were talking about. We always wanted them to have a link to somewhere else that could explain the problem in greater depth, or they could always come back to our Slack channel. That was our nice safety net.
Displaying errors to users
This is what we ended up as our first iteration. You can see it’s a pretty big departure from our table of pods. So this is the first iteration of the internal Spotify Backstage instance. This is not in the open source at the minute. We’re going to talk about that in a little bit.
Even if you don’t know anything about Kubernetes, maybe you only know that you run containers in pods. And that’s how your surface gets hit in production. You can pretty much tell straight away what’s going on here, because you know that you have seven pods. And three of them look good, I guess. It’s green. And four of them look not so good, because there’s a big warning message. And one of them looks twice as not so good, because it says two, and the other ones say one.
We wanted it to be pretty obvious what was going on, even if the user didn’t know a lot about Kubernetes, just by having visual representations of what is happening. It can also kind of help the user to actually learn about Kubernetes through Backstage, which is one of the things we’re trying to get through. And just so there was no confusion about what these errors are, or what these numbers meant, or the warnings. We also have an aggregated list of all the errors that the user’s seeing on the right-hand side.
Let’s look at an error. This is a Kubernetes error, which we can explain to the user what this actually means. So we say, well, this is the error we’ve detected. My container, mclarke error tester, has restarted two times. And this is what Kubernetes calls an error. And then we explain, OK, what does that actually mean, though? We say, well, the container exited with a non-zero exit code, and that exit code was two.
To go a step further than just reporting what the error was — because one of my pet peeves is error reporting that then doesn’t tell you what to do when something goes wrong — is we say, well, maybe you should check the crash logs to see if there are any stack traces, which tell you why it exited with error code two. And by the way, here are the logs of the service right before it crashed. So we just give the user all the information they need to try and figure out what’s going on here. They don’t have to go hunting for it, which is really what we were trying to accomplish. So this is an error that crosses that by a degree from infrastructure into the service.
A much simpler error, where you can probably immediately tell what’s gone wrong, is InvalidImageName, because we failed to pull the image dollar sign placeholder, because it’s not a valid image. And you can probably tell exactly what’s gone wrong. We forgot to replace the placeholder. So we say what this means, well, the image is not valid. And here’s some links to what a valid image name looks like.
As we released this, we were worried, would people actually even like this? People kind of liked the table we had before. But then we realized when we released it under a feature flag internally, and we talked to some other of our Kubernetes admins, and we said, when you’re debugging someone else’s issue, they’ve come to us with a support question. Instead of using kubectl or a dashboard, switch this feature flag on and go to Kubernetes, the Kubernetes page, and Backstage, and see, does this actually help you? Can you even make sense of this? Hopefully, if you can, then they can. And we started to get some really great feedback from our Kubernetes admins that actually, this is really useful. Maybe you should just make this the default view for internal users.
In April, we did that. And we immediately saw this huge increase, like nearly x100 increase in interactions with the Kubernetes plugin. Before that, what we think was happening was that people would go in, and it would say something’s wrong, yes or no, but it wouldn’t tell them what exactly was going wrong. And then they would go straight out to kubectl, and they wouldn’t interact with the plugin any further. But this kept the user, like we said, focused on their service in Backstage. They didn’t have to context switch. And they got immediate error reporting about what was going on. They didn’t even have to hunt for it.
One of the things I said at the start of the talk was about talking to users and getting their feedback, which is really important. You’ll notice that maybe the difference between this view and the previous view I showed you, which is that we have some nice new features up on the top right. And this is one of the important parts of platform engineering, which is getting feedback from your users, because we released the first iteration, and we were like, this is great. What more can anyone want? Then users came to us, and they told us exactly what more they wanted. And they were totally right to do that. Because we had missed some things that users thought were crucial, because that’s easy to do if you don’t talk to your users.
Additional features
These are all features that we added through feedback from users. So we talked about those two different types of users. Users who are really into infrastructure. They understand Kubernetes, Docker, deeply. And maybe they don’t even need our help to find these errors. So instead of completely getting out of their way, we decided, what if we just give you this little connect button, and at least it will give you the kubectl command. You need to run the set your context and namespace. We could get you there a bit faster at least. And then we’ll get out of your way.
Then users at the other side of the spectrum, they didn’t want to have to click through each pod to see the crash logs. So we gave them an aggregated view of crash logs that they could see all at the same time. And then because Backstage is not a log aggregation platform, we’ve also got a link to an actual log aggregation platform in case you need to do some deeper dives into what has been happening historically.
What’s probably my favorite feature is the “Group by” option, which is the option to group the pods that you’re seeing in different ways. You can see in this view, we’re actually grouping by production. One of the things we got from our users was, we had this incident. It was in this region, and it was difficult to tell it was in this region because all the pods are aggregated by the environment. So we said, what if we give you the option so that you could actually group these icons instead of by environment, but by region, or cluster, or commit ID? That allowed the user to infer where these errors were happening based on the different values that they were grouped by. They could immediately say, oh, actually, all the pods in these clusters are okay. It’s just this one, maybe something’s up in this cluster, or maybe something’s up in this environment or this commit ID that we’re rolling out. We need to revert. So yeah, super important to get feedback. Like I said, my favorite feature, we did not come up with. It was given to us by our users.
Now because this talk is not called how we demystified “our” Kubernetes platform with Backstage, let’s talk about how you can do the same. In terms of building your own platform, one of the difficult things with implementing a Kubernetes plugin for Backstage was keeping it generic enough that other organizations could use it, but it was still actually useful. And the reason for this is that my Kubernetes setup isn’t like your Kubernetes setup. Maybe you have a different authorization mechanism. Maybe you have more clusters, maybe you have less clusters. You use different cloud providers. You install different things on other clusters. You have a different multi-tenancy setup. It really depends. So one of the hard parts was forming the correct abstractions for this plugin so that we weren’t just releasing the Kubernetes plugin if you do everything like Spotify does it.
In practice
This is from the original RFC in 2020 about open sourcing the Kubernetes plugin. We came up with these three abstractions. Basically, three questions that if you can answer, then you can run this. And those are, how are you going to authenticate against the cluster? Are you going to let your user use their client IAM, for example, their Google account, and then it gets proxy to the cluster, and then all the requests are coming from that user. Are you going to have a server-side service account? You know, maybe keep everything read-only, but everyone can see everything. That’s also fine. Then there’s the cluster provider, which is basically service discovery for clusters.
This is you telling Backstage, how are you actually going to find the clusters in order to interact with them? You can do this through config. You can also do it by just kind of pointing it to an AWS EKS cluster. You can point it towards a GCP project, and it will just scan all the clusters and start talking to them. You can also ingest a Kubernetes cluster into the backstage software catalog, and use that as the source of information, which is pretty meta, but it’s a super interesting way to do it.
Then there’s the service locator. It’s probably the thing that is most dependent on your multi-tenancy setup, which is basically saying, which cluster is my service running on. The default being, which works for smaller amounts of clusters is, I’m just going to look at every cluster. And when I find something, I’ll report it. And when I don’t, I’ll say nothing. But you can also add implementations to support a mapping of services to clusters. We’ve got a bunch of other ideas about how maybe we can use the software catalog in order to do this in a pretty hands-free way.
Lessons learned
So, lessons from all this. I think one of the big ones is talking to users, asking for their feedback. Like I said, my favorite feature was not my idea, which was actually great. It was a really great feature. It’s used every day, very useful. Also, treating your platform like a product. Just because it’s an internal tool, doesn’t mean you just throw it out there. And you say you can use this. You cannot use it. Good luck. You have to actually provide support, update it, take feedback on how you’re going to support these tools going forward. It’s also a good idea to shift interactions to be service-oriented and not infrastructure-oriented. I guess it should say over, because we still have those users who are interested in getting second to infrastructure problems. But we have to know that not everyone is like that.
The other thing is automating your debugging processes by thinking, okay, how would I solve this problem generically, and then maybe I can automate it so that I can solve this problem for all of my users.
So the question you might be thinking is, when is this new error reporting UI coming to the open source plugin? It’s coming actually very soon. We’re in the process of open sourcing it. But I wanted to give everyone an opportunity to give feedback on the UI or to make suggestions on it before we open source it. So you can see it’s at that GitHub issue on Backstage.
Feel free to just give your feedback. And if you’re thinking, how can I help? You can integrate Backstage against your Kubernetes setup. You can talk to me on the Backstage Discord. You can contribute to Backstage to Kubernetes plugins. I would say that not all contributions need to be or even maybe should be code. So you can contribute tests, docs, design, product strategy. There’s a bunch of different ways that people can contribute.
I’ll just finish by asking you to come help me make developers more productive on Kubernetes. So thanks very much.
Q&A
We’ve got some time here. If anyone wants to ask any questions.
You talk briefly about how backstage integrates with, I guess you have some providers that are, you know, you can write your own plugins and stuff, if providers that are provided don’t work. Have you found that the community has been pretty active around adding additional providers to the ones that maybe you initially created at Spotify? And in particular, have you observed any folks using Rancher as an intermediate layer for off in particular so that Backstage users could authenticate against Rancher and therefore get access to their stuff in Rancher projects and stuff, which is an extra layer of abstraction. But I’m wondering if maybe some of the community has already done some of that work.
Yeah, that’s a great question. It’s a great point. Have we found the community has been active, like contributing different implementations for the different cloud providers? The answer is yes, because I actually only implemented the Google one because I didn’t want to go and implement the AWS one, the AKS one, like I was the expert on there. That’s not what I was doing every day. I wanted to give someone else the opportunity to contribute that. And, yeah, like straight away, the community has already provided implementations for EKS, for bare metal, for AKS. And I think Rancher is there as well. There’s definitely been some contributions towards linking components straight to the Rancher dashboard, I think. I haven’t used Rancher myself, but I think it is definitely there. Cool.
Also, could you double-click a little bit on what you keep calling it a Kubernetes plugin? But could you maybe double-click on that? I mean, is this just like an agent you deploy in a Kubernetes cluster and then it’s a piece of independent software? But it looks more like it’s embedded into your Kubernetes clusters. I’m curious if you could talk a little bit more about that.
Yeah, that’s a great question. So it’s actually not something that you install onto your Kubernetes cluster at all. The Kubernetes plugin to Backstage is made up of a couple of different components. There’s the frontend component, and then there’s also a backend, which acts like a proxy. It finds out those different Kubernetes clusters. It maps the authorization that you’ve specified to use against those clusters to actually how you send that to the clusters. And it’s basically our proxy that does a lot of the heavy lifting to talk to those Kubernetes clusters.
But, like I said, you don’t actually have to install anything on your Kubernetes clusters to get this to work. The only thing that is necessary to do is to tag the resources with the Backstage component that it relates to, to make it easier to find. There are multiple different ways that you can do that. And there’s lots of information in the docs about how to do that and how to debug it.
OK, so when you call it a plugin, the plugin is not a Kubernetes plugin. Yeah, exactly. So it’s a Backstage Kubernetes plugin for Backstage, though, not Kubernetes. Awesome. Thanks very much.
Learn more
- Subscribe to the Docker Newsletter.
- Get the latest release of Docker Desktop.
- Vote on what’s next! Check out our public roadmap.
- Have questions? The Docker community is here to help.
- New to Docker? Get started.
Find a subscription that’s right for you
Contact an expert today to find the perfect balance of collaboration, security, and support with a Docker subscription.