Okay, this is going to be interesting. We are relying on the Wi-Fi bit here as well. So it would actually help if you turn off your Wi-Fi. I know that's a big ask. Consider that for the next half an hour. That would be really helpful. So Vanessa is live here through a video call. Give us away, Vanessa. We can... Well, can you try speaking? What's up, folks? Sorry, I'm not working. Is that working? Try again? Still what's up, son of them? Okay, that's really better. Nice. So we'll start your recording, Vanessa, and then we'll try and do live Q&A at the end. Sounds good. I have some answers for the previous Q&A, too, so we can talk a little bit about that. We can try. We can try. By the way, Vanessa is also the one who designed the HPC social logo. So you should thank her for that and take some stickers when you leave. Thank you. Thank you. All right, here comes the talk. Hi, folks. I'm Vanessa Socket, and today we're going to be talking about Kubernetes and HPC, the bare metal bros. So I thought I would open this talk by putting two words on the slide and then I'll go to the next question. So, what is the question that you guys have been asking or very anxious? Those words are cloud and HPC. So probably the question on everyone's mind is what does the future look like? I'm going to answer this question by posing a question back to you. Where is the money going? So we can look at polls from Gartner and Hyperion Research that suggests that cloud is projected to reach $40 billion by 2026 with a smaller CGR of 6.4%. So very superficially speaking, the money is going to cloud. Now, we can also then follow up on this question like, okay, that's great, but who's going to get left behind? We can look at a paper from Reed Gannon and Degar from 2023 that identified some really interesting trends. For HPC, it suggested that the way that we design our system will not be a problem because we're not going to be able to design our system will not continue to work. We cannot depend on dentered scaling and Moore's law. There's increasing rising costs for improved semiconductors. This is going to make it harder and increasingly more expensive and laborious to deploy new systems. And they define something called NREs or Non-Reoccurring Engineering Costs that we are incurring for every new system. Now, cloud, on the other hand, is leading the space of innovation. As we know, there's this massive expansion of large-scale commercial clouds. They are not depending on software vendors or hardware vendors. They're making their own stuff in-house. And guess what? They're hiring away and attracting the talent pool. And they made a really interesting analogy with temperature. They described HPC at endothermic requiring the absorption of heat for survival. And cloud is exothermic and really giving off of heat. And we know that, folks, we're not talking about heat here. We are talking about money. But to continue the heat analogy, you'll know that if you've ever been out in the snow in a cold environment, you are much more likely going to be wanting to give off heat to survive. So who gets left behind? Well, the person that needs to constantly absorb heat that's probably going to run out is the person that needs to absorb heat. And that's the reason that we're all here. It's because we need to ensure that the needs of our science are represented in this new environment. And guess what? The success of our science, the reason that we're all here, really depends on our ability to be collaborative in this space. And so this is really kind of the manifesto of Converge Computing. So if we bring them together, we get this new technology space where we have the best of both worlds. So where do we start? Well, here is how the talk is going to proceed today. We're going to start with models for convergence, talking about patterns for bringing together traditionally disparate environments. We're then going to move into strategies for convergence. So designs that I've noticed allow for easy movement between the spaces. So let's start with those models for convergence. Now, if you've looked in paper land, you've probably seen many different models. There's many different ways to take HPC and cloud and put them together. I'm going to talk about the high-level patterns and from the perspective of someone that's maybe deploying a system. So let's say that's me, and let's say I want my cloud and HPC, I'm going to take my limited set of resources and I'm going to try to split them into two steps. So I spend a ton of money and I do this, and then, I chose poorly. No one's using half my resources, and oh my god, so four years later I come back and I'm like, all right, I want cloud, X or HPC exclusive or HPC. I understand I can't have my cake and you to choose, so I am just going to choose one. We've used HPC for all these years, red and butter, this is why you've always done things. I choose HPC. Great, six months later, someone comes into my office. Are we dinosaurs? You know, everyone over there is using YAML and automation and we have this old setup and ah, so you go back in your office, you contemplate your life choice and you're like, oh right, no, it's okay, I'm not going to wait another four years. I'm going to sneak it in. So this is where you see all of these ideas, like bursting, multi-cluster, and these are generally referring to this idea of having some home base of resources and reaching out to get more. And the problem with this approach as I see it is that the complexity of these approaches often reflects the complexity of the systems. So they tend to be snowflake, they tend to be complex, and this is why there hasn't been like a single leader that has emerged in the space. So here is a different idea that's less common because it doesn't superficially kind of make sense. I want cloud or HPC, meaning I want to be able to run HPC, or cloud, or at the same time, or something together that's more converged, like what the heck am I talking about, don't I? Am I talking about, don't worry, we'll talk about it. Let's first talk about strategies for convergence. So these strategies I need to point out, these are not just about the technology, they are also about the people which is often harder. The first is common goals. In order to get two different communities working together, they have to care about the same things. You can't get around that. The second is modularity. So the degree to which your application or infrastructure can be modular, is that you can use things interchangeably and swap them, be very creative. The third is integration. This is consumption of an entire thing in another thing by way of different strategies. So let me give you some examples. For goals, the best overlap of goals I've seen is with respect to batch workloads. So a few years ago, the Kubernetes community started the batch working group, and this was because this new need to have AI ML workloads in Kubernetes. Traditionally, Kubernetes is where you run services, you keep something running. And there wasn't this concept of starting something and having it complete, but all of a sudden there was this new need, and guess what? We have been doing that in HPC land for like a couple of decades now. Modularity, a really great example, is actually with Kubernetes and Flux Framework. So you may think of Flux as just like this workload manager, but actually it's called a framework because we assemble many different components together to assemble into the workload manager known as Flux. Kubernetes is the same, different set of components, and there is going to be a creative way that we can kind of use these interchangeably. So the final example of integration, the best technologies I can provide are containers and language bindings. Container technologies are literally this vehicle to let you move between spaces, and language bindings are going to let you take it traditionally like C++ HPC project and extend it into a language that is native to the language and extend it into a language that is native to cloud. So for example, Go. Alrighty, let's get into some examples just like eggs three ways. Here are some projects that we've actually been working on at the lab. The first is Fluids. As I alluded to, this is the Flux scheduler, swapped with Coop scheduler. The next is the Flux operator, the entirety of Flux Framework implemented inside of Kubernetes. And then the namesake of this talk about air battle grows, Flux and Kubernetes working side by side. So let's start with the Flux scheduler within Kubernetes. You may be familiar with Kubernetes when you launch a job. You ask for a certain number of resources that's given to the scheduler. The scheduler says, okay, here are four pods. Have a nice day. So what we're going to do is bring in Fluents. So our C++ package, FluxSched, that is mapped with Go bindings into a custom scheduler plugin. We're going to swap it. And so you're basically going to be asking for the same amount of resources, but the scheduling is going to be done by FluxSched. How does this do? Well, we find that the workflows run three times faster. So what you're seeing here is Coop scheduler on the top, Fluents on the bottom. You see a lot of randomness with respect to how Coop scheduler places jobs. What this leads to is a pathological scheduling pattern. So anywhere you see a red box on there, that is a startup delay. And what that means in practice is though, is that although the workloads themselves run in similar times, we have a lot of outliers. We have a lot of jobs that take a really long time to get started. And so Fluents improves upon us. So Fluents is a really great example of modularity because we're taking an HPC technology and we're literally swapping it. And the modularity of the software allows for that. It's also a great example of integration. Because we have those Go bindings, we can speak the language of the cloud need of communities. Alrighty, next project, the Flex Operator. Super cool. All the gophers in Flexland are pretty cool. Alright, so the Flex Operator is implementing the entirety of Flex framework inside of Kubernetes, your own HPC cluster. This happens by way of a custom resource definition of CRD, where you basically give all the parameters that you want for your cluster, whether that's a single job or whether you want an interactive cluster. This creates what we call the mini cluster, which, you know, Flex Operator is a mini cluster, which, you know, Flux doesn't know the difference that it's running in Kubernetes versus on bare metal. There's a lead broker that's connected to several follower brokers. So here you have one pod for one physical node. The tree based overlay network within each pod or node, you have Flux that's added on the fly to your application. And the Operator is just going to basically reconcile until the state that you need for your cluster matches the actual state of the cluster. How well does it do? We added it to the best in the space last year. The MPI Operator and the Flux Operator consistently outperformed the MPI Operator we believe because of the 0MQ bootstrap. So the Flux Operator is a beautiful example of integration because we're taking the entirety of Flux framework and implementing it inside of Kubernetes. Bro, bro, bro, is it time for the bare metal bro? Yeah! Okay, so, warning. I've been saying bare metal, but nobody's going to give me bare metal. Let's be frank about that. So I was using virtual machine. We're using virtual machine as a proxy for bare metal. So just a warning. So what's different about this picture? The orange is on the outside. So we actually have Flux framework on the outside spinning up a Kubernetes cluster and notice that we actually still have compute running on bare metal alongside Kubernetes. How's that possible? Don't worry, I'll tell you. So why do we need this in the first place? As you know, also, there are increasingly more complex heterogeneous workloads that are coming to HPC. So this means not just, you know, embarrassingly parallel stuff, but also adding in services, databases, task queues. Ah! Okay, so I was... This slide is not wrong. I was going to give you an example of such a workload, and apparently this slide is giving you this warning that I'm a bad scientist and I'm not wrong, but I will point out that my example is actually a very good example that is a prototype for this kind of design. Let's talk about that. So let's say that we're running simulations. We're training examples one through N, whatever, doesn't matter, and we want to send them to a machine learning server, a specific endpoint to do the training. We then want to wait till some metric of goodness or perhaps a number of samples, and then we want to flip it around. We want to run simulations again, but we want to instead give this to our machine learning server without the actual values, then we're going to have a vector of the true values and the predictions, and we're going to see how well we did. Now, very superficially, if we match this to HPC versus Kubernetes, this is how we do it. We would expect that the simulations would run better on bare metal, and the service thing would run better in user netties or Kubernetes. This is way to be... We need to prove to ourselves first. So a lot of you are probably out there like, user net, like, Kubernetes? Like, in user things, are you nuts? I'm not nuts. There's actually something called user netties. It's a Kubernetes enhancement proposal or CUP proposal in 2022 by a very talented developer named Akihiro Sudo. Akihiro must point out won the top maintainer award for KUKON last year. He's an incredibly talented developer. If you've used any of these technologies, he's the one behind it. Hats off to Akihiro. So last year, at the beginning of the year, user netties was really a hodgepodge with kind of bash grips. It was really hard to use. So I engaged with Akihiro and we released Generation 2 of user netties in September. And guess what? It is using containerization, which is really great. It has these components that we'll go into in more detail. So what does it mean in practice? Well, it means when you're building a virtual machine, you need to have C groups version to enable. I recommend LIMA or Linux virtual machines if you're prototyping this for the first time. It also means that you need to enable these kernel modules. So very generally speaking, the RNet filter is going to allow you to apply IP tables, rules, bridge traffic. VXLan is going to allow you to connect VXLan devices on different hosts to a standalone bridge. This is important because we actually have different physical nodes. Now it's going to use RULE stocker. This isn't such a crazy idea anymore. Many clusters have podmin these days. And so what does it mean? Actually, when you bring out these VMs, it means that you're going to run a make up command that has two contexts. So both of them are going to build and start a base image that is using kind, kubernetes, and Docker with CNI plugins. And then the two contexts are the control plane and the worker. The control plane is going to install Flano, run kubernetes, and admit. This makes a joint command which is basically a token that you give to the workers, and then the togers can authenticate and join the cluster. And so that's what they do. They're just like, I'm ready to serve. All right, so we created this garbage cluster small and mighty using Overt and Ansible. It is small and mighty because each has eight cores and 30 MBs RAM and a 10-NVVD iterate. And I want to point out that we have seven nodes here because generally speaking, we're going to have six that we run things with compute on and one's going to be an admin node or control plane. Again, warning, not bare metal, you get the deal. All right, so what's in these VMs when we bring them up? We have a complete system install a flux, singularity on bare metal for reasons I'll tell you a little bit. Lamps installed on bare metal and of course user netties ready to be brought up. So once I shell into these VMs, my flux cluster is ready to go. I can do flux resource list and I can see all my nodes. And user netties, again, that administrative node is also a control plane. So we technically have six nodes to work with. And then we have a user netties. So we technically have six nodes to work with. And we can still see them with coop control get nodes. Here's what we're working with. User netties and flux running side by side the bare metal bros. All right, bro, bro, what experiments do we want to run all of them, bro? All right. So we first need to sanity check that what I said earlier about the bare metal and lamps and the simulations is actually true. We need to look at application performance between flux and user netties. So the way we're going to do that is by running a few things. We're first going to run lamps on bare metal with flux. We're then going to do the same thing but in a singularity container. And I did this just to demonstrate that you don't lose anything by using containers. Here's great. We're then going to run lamps in user netties with the flux operator. And then finally we're going to repeat cases one and two, but with user netties running in the background to look to see if there's any overhead of that. And I need to pause for a second because I know how incredibly cool this third case is. We have flux on the outside. Flux is running user netties. Within that we are launching the flux operator which is bringing up another instance of flux and inside there is where lamps is running. So folks, like I know Thanksgiving is over but this is the ultimate production. And we expect lamps to be slower in user netties because as we know it makes MPI collective calls. User netties are using something called SERP 4.NET NS which requires additional processing of packets with a tap device. I have a great paper I can share if you're interested in learning more about that. So drumroll the results as we expected the well actually maybe we didn't expect but guess what the bare metal case is the singularity container is very comparable to actual bare metal. I was very surprised by this. So user netties does not add a lot of overhead. And this is what we'd expected that guy up there running in user netties is about twice as slow as running on bare metal. So what did we learn? Well, we learned that for a setup like this the network sensitive stuff probably should be run on the HPC. But I'll point out there's opportunity for improving this in user netties. If you have experience with networking I'd like you to go over to the GitHub right now and I'm just going to wait a lot for the talk and engage with that to hear it to work on this problem. Now the next thing we want to look at is distributed machine learning specifically two cases one distributed to across six nodes and then the second on one node so the distributed case network is a variable and for the one node obviously network is not a variable. Drum roll results same thing it's about twice as fast on bare metal or twice as slow I guess on user netties. And interestingly when you look at just a single node these are really comparable so there's no issue with running something on a single node in user netties in and of itself it's really when you bring in the networking that it becomes a variable. So it's a network right well let's sanity check one more thing here's I per thing we did one bit of transfer for each node as a client to each node as a server we see bit rate and give you bits per second is between 10 and 30 for bare metal user netties with like non detectable closest here are really really terrible we can look so we can see the same patterns for transfer gigabits per second and so yes it's the network we're pretty confident for the setup it's the network. All right can we do the fun workflow now we absolutely can so guess what I actually prototyped this kind of workflow because I was really excited about it and so what we're going to do is we're going to be launching a batch job with flux batch this means the flux instance that's only by the running user it's going to scope resources using hw lock in this backshot where we can basically bring up and tear down all of user netties. We're going to take that workflow that I mentioned before we're going to map it into our star track cluster space so we're going to run simulations with lamps randomly selecting the problem sizes predict well time we're then going to bring up a machine learning server a special server I made using river a few years ago and then we're going to basically do the test cases we're going to run lamps again but we're going to leave out the actual well time and we're going to ask our models what it is and we're going to do a thousand training samples and 250 testing samples. How do we do? I put no thought into these particular models but I did three kinds of regression the Bayesian and sampling from a probability distribution didn't do super well but for the first two there's an actual kind of pattern between the predicted and the actual time and so although I put no thought into this I was really pleased with this result to see that the general prototype this idea of having bare middle simulations running alongside a service there is something here we can do science this way with actual real scientific questions and I'll point out that there are real heterogeneous workloads out in the wild and you this capability here's Moomi the massively parallel multi-stale machine learn model infrastructure and this is basically simulating biological systems the interact between proteins and plasma membrane I'll also point out that the Moomins are what it's based on the name the finished book comic book series with really cute hippos with often yellow spiky hair very awesome so this is the perfect example the bare metal rows of coexistence adopting technologies to make it possible to go to coexist and continuing to improve upon them so that for example with networking this environment can get even better so what should you remember from this talk if you take nothing else away the first is looking out for opportunities for collaboration look for that alignment of goals between spaces that's an opportunity the second is providing handles for your components so you don't have the bandwidth to look for opportunities add some go bindings to C++ projects because someone else could find you the third is engagement we need to show up at the table we need to go to working groups, conferences places that you haven't traditionally been to engage in to find these opportunities for collaboration and possibly the most important is this mindset we've had this mindset of cloud versus HPC that one has to win but they're different for so long we need to throw that away and get rid of the adversarial thinking and have a more collaborative mindset this is the vision that we have for the future for converge computing and we hope that you like to join us so thank you that's how to reach me my email and social networks and here's some interesting links for the flux and the various projects I think I will take some questions virtually now okay we can take a couple of questions it seems like the wifi is stable enough to let Vanessa answer them do we have any questions okay so Vanessa we may have to repeat a question for you we'll see how that works hi Vanessa amazing talk congrats so I was wondering if your architecture can support sidecars because one of the nightmares I had when I was trying to do something similar was that in order to get the sidecars running I had to spin up a second network stack and that created a lot of overhead so no no just one is on okay did you get the question Vanessa no I didn't hear the question at all neither did I yeah maybe that's better okay let's do it like this you'll come up front and ask it here yeah that's perfect that'd be great I can hear you great hi there hi so I was wondering if your architecture can support sidecar containers because as I was saying when I was trying to do something similar when I tried to create the sidecars I had to create a second network stack within singularity so the network overhead was amazingly high so absolutely a flux operator actually uses a sidecar container on a net container which is similar in concept to add flux on the fly as a view what's going on in Kubernetes is sort of a different thing than the networking issue so the short answer is yes to kind of add to that though I'm not sure that singularity and Kubernetes singularity as the container runtime for Kubernetes would work I have never tried that but it doesn't sound like it would work yeah it needs to be done yeah exactly hi Vanessa thank you hi it was the most fun presentation on the post then so far thank you so when you were saying that the main difference between performance between EBM and bare metal workloads was related to network was that the case also for distributed training and if that's the case were you using infini band or not so this we did not have infini band and you make a really good point that this kind of setup would need to be tested with an actually great network and that is still a very big challenge even for cloud so for example if you use AWS you can bring the elastic fiber adapter which will give you great networking performance but if you go to other clouds and I don't have to name this specifically you tend to only get really good networks when it comes to using like TPUs or GPUs the exception though is Azure which has a lot of really great HPC stuff kind of built in so absolutely you can get that setup with infini band Hi thank you for your talk I had a smile on my face the whole time thank you for having such high energy at the end of the day what was I going to say oh yeah so probably in my workloads I can reduce the network traffic by a very large margin if I can constrain certain jobs to specific nodes because then large files don't have to be moved for certain jobs to across the network is that something that you could keep in mind so if you remember the very quick machine learning experiment that we showed when we're running something on one node and you're not using the network there's no issue so if you're just running something on one node in user netties you won't have an issue in a degree to which you can reduce anything that uses network so moving data MPI etc etc you will get similar performance at least from this small prototype experiment that we've seen as you would on bare metal I have to do this because it wasn't really bare metal thanks one more question hey Vanessa that's Danny I'm gonna die my hair soon so you won't recognize me again I really liked your framing actually I thought I was going to sort of being adversarial and then I actually realized what you were saying and I really appreciated it however though regarding the adversarial framing I have some experience with for example cloud tools and cloud environments being used as platforms for vendor lock-in I think that you described especially with your converged computing kind of the way that you can push back against scientific labs aren't kind of in-depth to corporations I actually think that you kind of made a really useful example of one way to do that in your talk so again I actually was very very impressed by the way you kind of explained that I would like to know in the more general sense how can labs and potentially RSEs make use of cloud tools without getting locked in or becoming beholden again to a corporate environment and again by the way I think that you effectively did that in this talk so I'm more looking for a general kind of thought about that You're totally correct that vendor lock is an issue and when you tend to see many sort of niche APIs in different clouds and then you built your entire thing around them you do face that as an issue but the great thing about Kubernetes is that it is this open source project that is available across clouds there are subtle differences but if you make a workload that can run on Kubernetes you're going to have an easier time to move it between clouds and that's you know speaking from my lab we work on flux framework and one of our goals with flux is to make things portable not just between clouds but between cloud and HPC that's also something like user netties running actually Kubernetes on bare metal alongside HPC is so important because all of a sudden you have the same workload and it runs in all the places that is sort of like the vision we don't we want to make sure that the scientific workloads that we're running today can run in all places not just to one niche specific cloud not just one niche specific center just convergence TLDR that is very exciting and I really appreciate that response thank you so much okay that's all we have time for this workout great Vanessa I hope you agree yeah it was really fun if anyone has further questions and stuff please reach out to me I love chatting it was a pleasure chatting with you and I hope you have a great rest of your fun then thank you and the best way to reach out to Vanessa is via HPC social so don't forget to grab a sticker and you walk out please consider doing a small donation in the box as well to help cover the costs and if you're leaving please check if you see any trash around please take the trash with you bottles anything anything you clean up we don't have to clean up thanks a lot Vanessa this was great bye