Okay, hello everyone. My name is Mateusz. I work at ThreadHat as a principal software engineer in the Kubernetes bare metal networking team. So yeah, as the title of the talk says, we'll be talking about bare metal networking and I wanted this talk to be somehow a gentle intro into what you need to think about when you want to start doing Kubernetes on bare metal, but the thing that Kubernetes doesn't tell you you should care about. So we'll see in a moment what that means, but I work at ThreadHat. I already said this. I'm based in Switzerland. When I'm not doing computers, I'm doing farming. I actually make it much much better, but it doesn't pay the bills, so I need to do the stuff that I'm going to tell you about here. Well, it is what it is. I don't do AI as opposed to, you know, all the hype and all this kind of stuff, so yeah, I'm not really on the hype wave. Bare metal was never really hyped, so well, what can I say? Some intro why we may even think about doing containers for bare metal. Like, you know, no one ever told us to do so, so what the heck is the deal? So HPC and AI. This slide predates the AI hype, so sorry for this. I could remove it, but long story short, there are some workloads we really want to benefit from running for bare metal. You may have some fancy GPU from, let's not name the company, or some network adapter, which is, you know, something that you really want to have access to the hardware directly, or the other side of the scale. Something that you run and is critical to any part of the infrastructure that you already have. Like, for example, network equipment. You don't want to run router of your own data center as an instance in AWS, right? That would be somehow, yeah, we shouldn't do this this way. Or something which is almost forgotten, and you know, then people call me and put this use case. Benchmarking. How do you benchmark hardware, CPUs, and this kind of stuff if not by running workload directly on this hardware? Again, you don't want to create 50 VMs on some CPU, only to get the benchmark of this CPU performance. That would be chicken egg. Let's not do this. So now fast forward. We agree that we want to do Kubernetes, and we agree that we want to do this on bare metal. So we go to Kubernetes.something, I don't know what that is today. We go to the, you know, FAQ, installing a cluster, and we start reading. What do I need to do to install a cluster? Is there any tooling that would help me installing this cluster? And the very first page you see is this installing Kubernetes with deployment tools, and they tell you QubeADM and to some other tools. And we are like, oh, so lucky. There are tools that are going to do this stuff for us. Okay, let's check the first one. You go to QubeADM and we start reading. Using QubeADM, you can create a minimum viable Kubernetes clusters. And, okay, so is MVP really the production cluster that I'm going to run? Well, probably no. Let's keep that tool. The second one, we look into K-Opps. Okay, let's go to the website of K-Opps. Let's do the same. Installing Kubernetes, getting started, and we start reading. Deploying to AWS, to GCP, digital option, yada, yada, yada. None of them is deploying to bare metal. Thank you very much. End of the story. Let's check the last one. Maybe that's our chance. So we go to the Qube spray. It's a set of ansibles. So another story, you know, but, okay, someone gives us some method to deploy Kubernetes on bare metal. So we go, run Qube, Qube spray playbooks. With the bare metal infrastructure deployed, Qube spray came now in, so Kubernetes and set up the cluster. And you start reading those playbooks and you feel like, oh, this is so opinionated. So either I want to do my data, either I want to build my data center like they want me to build, or thank you very much, there is no tool. So let's agree that none of these three methods is for us. We need to do this stuff ourselves. So let's build the stuff, you know, brick by brick from the, from the beginning. So what, what we need to care about a cluster, and not only during the installation, but in general to have this cluster bootstrapped and then working. First of all is, of course, this is bare metal. At the end, you want to deploy this cluster because there will be some workload, right? You want to access this workload. As well, you want to access the API, right? Basic operations. You don't deploy the cluster for the sake of deploying it and running, consuming the energy. Then, of course, DNS infrastructure. You are deploying this in your data center. And then what? Are you going to give to your customers? And now, you know, type this IP address slash something, something to look at this fancy web, website or application that we deployed. No, you want to have it some very nice domain and, you know, but for that, again, DNS infrastructure, you need that. It doesn't come for free. The next step is we agreed that we are doing bare metal because we have some reason to do this and it's not like we just don't like a simple VM from AWS, which means there will be some non-standard network configuration. Doesn't really matter if fancy or not. It will be something more than just, you know, plug the server, turn it on because in most of the cases, people doing bare metal, they don't have DHCP in all the networks or they need some storage network and it all requires some fine tuning which doesn't come from default when you boot your Linux distro and some other dirty tricks that I'm going to tell you later because it's Kubernetes specific and I want to build my way up to this. So cluster load balancer because I told you that you need to have API and ingress to your workload and all this kind of stuff. The slide is overly complicated for two reasons. The first reason is because it is complicated. The other reason is because no one ever cared to make it less complicated. I know it sounds bad but it is what it is. So the only thing I want to tell you is that, you know, we are in the story of building a cluster installing it from scratch, which means we are starting bootstrapped from summer. Like, you know, you may be running those cube ADM create cluster, yada yada, from this laptop, right? So this laptop will be your initial bootstrapping infrastructure. On the other hand, at the other side of this room, I have those three servers that are going to be masters. So this somehow has to ride all together. I need to have some IP address that will be this API finally when I spawn all those nodes in the cluster. So I need to have some virtual IP which will be pointing toward this API, right? This is what I'm calling API VIP and it sounds complex but at the end it boils down to one sentence. When you start doing cube CTL commands at the end, you need to target some IP address. If you are deploying Bermetal infrastructure, you don't want to ever target specific node because if this node goes down, all your tooling goes down. So you want to have some virtual IP and you may have some load balancer from well-known companies as an appliance or you may want to just do it yourself with keep alive this. So I will show this in a second. And in this slide, what is then the part? So at some point, we have deployed those control plane nodes, those worker nodes and we have the API address which should be now pointing only to the control plane nodes not to your bootstrap so this laptop, it goes away from this story. But then you have some other IP address because you are deploying Quarkode. You are not only an admin now. You really have something that runs and your applications, you don't want to expose your control plane to anyone, right? Or do you? Well, you rather not. So you need another IP and exactly the same story. Where do you take all those IP from and who manages them? Yeah, you manage them. So what you are doing for this and of course I'm telling you about some very opinionated way of designing how to install Kubernetes cluster and it's opinionated because we decided, so let's do keep alive D in the combination with HAProxy. And I told you the story why we need the VIP so you should be already convinced that if we need that, then we keep alive D because it's very simple and it's proven in action. Why do we put HAProxy in this story also? And now it will be fast forward to some specific use cases and requirements that we got. Only thing to remember is that it won't be always the same stock for API and ingress because your admin control, as an administrator of the cluster, I have usually different requirements than the user, so different tools, different purposes. Because it's very easy to simply deploy keep alive D and tell it, you know, let's pick this 1.2.3.4 IP and put it somewhere in the pool of this servers, right? But then Kubernetes is about being highly available. So what happens if your one node goes down? Well, the IP address should float to some other node that works, right? But what does it mean from the network perspective that IP address floats? What's going to happen with the connections that you have to this IP address? We start having this kind, we start asking ourselves this kind of questions because we have now three servers in the control plane, QBAPI server runs in three of them, we kill one QBAPI server, unlucky us, it was the one that was holding the IP of the, you know, of how we access the cluster. What happens now? No access to the cluster. So either we wait for keep alive D to move this IP address, our tables to propagate and all this kind of stuff or, and this is what we decided, we put HAProxy in between the QBAPI server and keep alive D so that HAProxy, and this is something that, you know, people from Kubernetes want to kill me, HAProxy is much more stable than Kubernetes API. That's it. That's it. If you look at this, that Xeq, QBAPI server fails much, much more than HAProxy, so this is our way to keep this and as simple as it sounds, the problem that I want to solve is that when QBAPI server dies, I don't want the IP address to float because propagating ARP tables and expiring the caches takes too long and I just simply don't want to wait for that, so I put HAProxy there and, and yeah, the only thing to remember if you really take this path is that you need to fine tune the health checks because then the worst you can do is that if keep alive D starts to notice outage faster than HAProxy because HAProxy also balances the traffic, right? So then the order of actions is that you want QBAPI server to die, which shouldn't be happening, but it happens. HAProxy notices that and end of the story. That's, that's it, keep alive. This should never, should never notice this and of course we may go deeper and what happens if HAProxy dies? Well, this is now a game of statistics. Has it ever happened for us that QBAPI server and HAProxy died at the same time? Well, it never happened apart if you go to the server and just plug it out from the rack. So this is some corner case that we don't want to cover, but, but it doesn't really, really happen in the wild. Of course, there are some limitations because, you know, you can have IP address on the single node. This is disadvantage versus some, some appliance. The biggest problem here is that you need to have all this stuff in one single L2 segment. So in one broadcast domain, this is because keep alive D doesn't work across subnets. We have some ways to fix that by grouping nodes into different L2s and then having different keep alive Ds in those L2s. But still, this is, this is a pain point and this is something that you should really well design on the, on the paper if you, if you start doing this. But, you know, enough of load balancers because we could be talking ages about this. DNS, because we said that we want to, to do this DNS mambo jumbo and, you know, we don't want to use IP addresses only. So of course you are administrator, you manage the infra. You could say, but, you know, we have this DNS infrastructure there. It's maybe AWS, maybe Cloudflare, maybe, maybe something else. So we can just create records there. But, but then, you know, either you trust the user or you don't. And we don't. So another opinionated thing in our way of installing Kubernetes is that we spawn very minimal setup of core DNS, which will be providing the DNS resolution of what you want to all the nodes of the cluster and all the pods running in this cluster. So that when you start installation claiming that you will have API running on API.example.com, I don't worry if you already created this record on the external DNS. I will just spawn static pod running core DNS and I will create those records myself. So whatever I'm running in this cluster will have this. This again protects me because now what happens if we decouple this? You have your external, you know, DNS like most of the people. And how do you want your cluster to behave when this DNS infrastructure goes down? You have your data center, everything is okay. In some other data center, you have DNS and this DNS is out. Do you want now your cluster to be, you know, dying because pods want to talk to each other and they cannot resolve DNS? It should be all self-contained, right? You don't want to have those external dependencies. So yeah, this is something that we are doing. And the part I will skip is that network manager requires some tuning because for people knowing how containers are spawned, when you start a container, a copy of ETC resolve conf is taken at the moment of starting the container and is plugged into the container. Meaning that if you change configuration of your host regarding DNS, it will not be propagated to the container unless you restart the container. So yeah, for this reason we are also hacking this file around so that it would be really updating on the fly but I don't want to go into this. Something a bit more interesting because we are going now into Kubernetes APIs and how to extend this stuff is network configuration of the host. This is static configuration file for network manager and probably you've seen this and probably you've made some mistakes to this file not once. The problem I want to state here is that this is a static file. You go, you modify it, nothing happens. You may notice mistake in this file five years after because for five years you haven't rebooted your server and we don't want to have this scenario in Kubernetes world. When you define some configuration it should either apply immediately or fail immediately. So this kind of stuff that you need to do manual modifications of the file, it breaks this contract we have and another part is it simply doesn't scale. If you have 300 servers in your bare metal cluster, you are not doing those changes manually. Simply not. You have CRDs and this is what should be happening. This is some very, very simple example. I do some modification. I mistake slash for backslash. They detect that and that's easy but I'm configuring default gateway as an IP address from outside of my subnet and this is utterly wrong but nothing in network manager will prevent me from this configuration. I simply don't want to. We have this CRD defined that creates host configuration from the API and it may sound like chic and egg but it's all the matter of how we order the stuff. We define Kubernetes CRD that will be defining how you configure network manager on the host. You can do it per node, all this kind of stuff. I will just show you how that works very quickly. That's the one. I have this node which has this IP address on the last interface 10, 24402 and what I want to do now, I want this to be different. I want to change that. I want to change it from the Kubernetes in a declarative way so that whenever someone will be modifying this, the change will get reverted. I just created a YAML which will configure IP address on some interface. As simple as that and I will apply that with the hope that it works as expected. At the top we can see that this CRD is now progressing with the configuration progress. In fact, that was as simple as it is so we can see that this IP was removed. For a moment I was thinking who's going to ask but you already had IP in this subnet configured. What's going to happen? Well, that configuration wouldn't fly because you should not have two IPs from the same subnet on the same interface. This is a short demo of that. At the same time, it's Kubernetes API. It should protect us from doing stupid things. I will try to configure a stupid DNS server which has no way of existing because it's on the link local IPv6 subnet. If I try to apply that, something should protect me from doing this because that would actually break the configuration. Let's see our configuration right now. We have 111.1 as the DNS server and let's apply this manifest. Now that configures the wrong DNS. The change has been applied. It's wrong. At this moment your cluster starts to misbehave, your probes go down and so on. Let's give it around 10-15 seconds and this configuration should get reverted because there is a controller which in fact checks if your modifications to the network infrastructure on the host. After applying, do they make something not working as it should? In this scenario, we see that degraded failed to configure. It failed because this DNS server doesn't exist in reality. That was just a short demo of how we handle all that. It's a bunch of self-contained elements that once you start using them all together, you give you a very nice Kubernetes installer that does it all for you. Sometimes in an opinionated way, sometimes less. Now I told you that there will be some dirty tricks. In KubeNet, there is a concept of Node.IP and we are now moving to the Linux world. When you want your application in Linux world to run and interact with the network, it has to bind somewhere. This somewhere is IP address and the port. Let's forget the port. We are about IP address. If you have multiple network interfaces, where should Kubernetes listen? Everywhere on one IP address, on two IP addresses. If you have 10 interfaces, what do we do? I say that Kubernetes upstream doesn't solve it in a very smart way because it was designed to run on clouds with only one network interface. As we started expanding, it's not something that we still want. We developed some additional logic to check that and I will skip the details. In general, one more problem to think about. When you configure KubeNet manually, you need to think what the IP addresses should be there. This configuration is complicated because actually you can say bind everywhere or bind to one IP address or you can say bind to IPv4, like as a string IPv4 and what happens there? It's all you know. You get even stranger syntax IPv6 as a string, comma and then IPv4 address. All this kind of stuff you need to understand how it behaves and pick your choice. It's complex. You may get really confused once you start. We have some set of rules. I will skip them. You can go back to this. In general, some corner cases, I just showed you an example in which you shouldn't have multiple IP addresses in one subnet. What if you do? There are some people who do this for a reason and how do you want KubeNet to behave then? Also, one example that I have and this is just mind blowing. It killed me for like two weeks. Is your IPv6 address really an IPv6 address? Okay, this slide I skip. I got to this RF, sewage describes IPv4 compatible IPv6 addresses and I was like, what the heck is that? Let's go to all the libraries in all the known programming languages. Every of them has a function. Is an IP address IPv6 address? You go to implementation. How implementation looks like? If string contains, colon return true. Thank you very much, game over. It's as simple as that. Really, for the last 30 years of my life, I thought this is as simple as it is, but it's not. Let's take this. So, comma, comma, four times f, then comma, sorry, colon, and then we put IP address with dots. It is a correct address. There is RFC for this address. It may look stupid, but it's a well defined address and, you know, it breaks. Try opening a net cut socket to listen on this address. It will not work because half of the tools now think this is IPv6 address, half of the tools think this is IPv4 address. I did a stress on that and what I realized is that based on this address, it was trying to open a socket on a simple IPv4 address. At this moment, how should we treat that? This is the real case scenario. I got it from a customer who was trying to install Kubernetes and they wanted to use this subnet. I was like, what is that? Then we dig deeper and we realized that this is a monster. It should have never existed, but apparently it exists. If you find a set of parameters that you pass to net cut and it crashes, then something went wrong. So, in the end, yeah, choose wisely what you want to do and once you design your infrastructure really, you know, double check it with someone out there with upstream community. Is it really how you should be doing stuff? Because in a lot of cases, you realize that something misbehaves and, you know, and that was, yeah, one more thing. You think everything is okay, then you start to get and you tell you, oh, sorry, but, you know, in fact, with this cloud provider, you cannot use this syntax and then you realize, oh, I wanted to do all that, but I cannot because you tell me that I cannot. So, you know, and you realize it only at the end of the story once you spend two weeks on designing. So, that's it.