[00:00.000 --> 00:10.540]  Next speaker is John Garbert from StackHPC who's going to talk about self-service Kubernetes
[00:10.540 --> 00:13.260]  with RDMA on OpenStack.
[00:13.260 --> 00:14.260]  Thank you.
[00:14.260 --> 00:15.260]  Hello, everyone.
[00:15.260 --> 00:18.140]  Yeah, I pressed the button.
[00:18.140 --> 00:19.140]  Excellent.
[00:19.140 --> 00:20.140]  I'm Green.
[00:20.140 --> 00:21.140]  Hello, everyone.
[00:21.140 --> 00:22.140]  I'm John Garbert.
[00:22.140 --> 00:29.860]  I'm here to talk to you about OpenStack, RDMA, Kubernetes, and are they oil and water mixing
[00:29.860 --> 00:32.740]  or are they bread, oil, and vinegar?
[00:32.740 --> 00:37.420]  Hopefully I'll come into you at something nice.
[00:37.420 --> 00:41.180]  So start with some thank yous from my sponsors.
[00:41.180 --> 00:43.180]  So I work at StackHPC.
[00:43.180 --> 00:47.780]  We're about 20-something people now.
[00:47.780 --> 00:52.900]  We've got people across the UK and across Europe.
[00:52.900 --> 00:56.220]  So I'm based out of Cambridge, but the head office is a lot of people around Bristol, people
[00:56.220 --> 00:59.980]  in Poland, and people in France as well.
[00:59.980 --> 01:07.120]  And we work on helping people create OpenStack clouds, train them up on how to look after
[01:07.120 --> 01:12.820]  them, and support them through that journey and everything that's happening there.
[01:12.820 --> 01:18.220]  For this particular topic today, I want to say a big thank you to all of these organizations.
[01:18.220 --> 01:20.580]  These are all in the UK.
[01:20.580 --> 01:21.580]  Lastly, Jasmine.
[01:21.580 --> 01:28.580]  So I'm going to talk today about how do we package up these solutions and stamp them
[01:28.580 --> 01:33.060]  out for people as reusable pieces, and this is a project that's come out of the Jasmine
[01:33.060 --> 01:34.060]  Institution.
[01:34.060 --> 01:39.860]  And that got taken on by Iris, which is an STFC community cloud project.
[01:39.860 --> 01:45.620]  So they're trying to get ways in which more STFC funded activities in the UK can share
[01:45.620 --> 01:47.060]  the same sets of infrastructure.
[01:47.060 --> 01:51.580]  How do we get one pool of infrastructure and share that between all of these different
[01:51.580 --> 01:53.580]  research use cases?
[01:53.580 --> 01:59.500]  And in particular, there's lots of organizations we've been working on getting feedback from.
[01:59.500 --> 02:04.620]  So we've been working a lot with the SKA community in the UK, particularly the SLC community
[02:04.620 --> 02:05.620]  at the moment.
[02:05.620 --> 02:10.980]  And they've been giving us great feedback on some early versions of all of this and
[02:10.980 --> 02:11.980]  how to improve things.
[02:11.980 --> 02:15.940]  And that's actually been funded partly by also the Dirac project, which is the HPC
[02:15.940 --> 02:23.380]  center, a group of HPC systems.
[02:23.380 --> 02:29.060]  Also note the small I, not the capital I, Dirac, just to confuse everything.
[02:29.060 --> 02:32.820]  If you look for the small I, Dirac, that's the group of the HPC centers as opposed to
[02:32.820 --> 02:36.700]  the job submission system.
[02:36.700 --> 02:40.740]  And we've been working very closely with the research computing services at the University
[02:40.740 --> 02:43.380]  of Cambridge and tying this together.
[02:43.380 --> 02:48.100]  One of the iris sites and one of the Dirac sites, and we're starting to reuse the things
[02:48.100 --> 02:50.100]  coming out of Jasmine.
[02:50.100 --> 02:55.100]  Anyway, big thank you to all those folks.
[02:55.100 --> 03:01.700]  So I want to start with, why on earth would you use OpenStack and Kubernetes and not just
[03:01.700 --> 03:06.460]  have one big batch schedule?
[03:06.460 --> 03:10.460]  And really it's about getting the most value out of the infrastructure investment you've
[03:10.460 --> 03:11.460]  made.
[03:11.460 --> 03:15.860]  And today also it's worth saying that the, getting, what I really mean by that partly
[03:15.860 --> 03:20.900]  is the, that investment in your infrastructure is also investment in carbon cost.
[03:20.900 --> 03:25.180]  How do you get the best out of that investment in carbon to manufacture these machines and
[03:25.180 --> 03:27.460]  run these machines?
[03:27.460 --> 03:28.660]  And what do I mean by value?
[03:28.660 --> 03:30.820]  Well, that's different things to different people.
[03:30.820 --> 03:35.900]  I mean, had we reduced time to science, how do we get more science out of that particular
[03:35.900 --> 03:40.580]  investment that a community has made?
[03:40.580 --> 03:46.460]  So firstly it's a bit about sharing diverse infrastructure.
[03:46.460 --> 03:48.780]  Hopefully people aren't hungry, apologies.
[03:48.780 --> 03:56.260]  I've spent far too much time on unsplash, so thank you to unsplash.
[03:56.260 --> 04:01.540]  So there's increasing diversity, as in different flavours on the pizza here, in lots of the
[04:01.540 --> 04:03.180]  user requirements.
[04:03.180 --> 04:07.860]  So in terms of the iris community, they're currently working actually a lot more with
[04:07.860 --> 04:13.060]  large international collaborations, and often those users come with a system that they want
[04:13.060 --> 04:16.460]  to run on your infrastructure, regardless of everything else that's happening.
[04:16.460 --> 04:20.340]  And so one of the problems that's been happening is you sort of silo your infrastructure into
[04:20.340 --> 04:25.620]  well, this was bought for purpose A, this was bought for purpose B, but actually those
[04:25.620 --> 04:27.900]  infrastructures are getting more diverse.
[04:27.900 --> 04:32.900]  There's only so many GPUs anyone person can afford in a particular institution, and everyone
[04:32.900 --> 04:34.020]  wants to use them.
[04:34.020 --> 04:35.060]  How do we share that out?
[04:35.100 --> 04:39.780]  How do we share out the accelerators and all the special bits of kit between these different
[04:39.780 --> 04:45.540]  use cases that day to day might be different people wanting to use those bits of infrastructure?
[04:45.540 --> 04:49.060]  That's kind of how do we slice it up?
[04:49.060 --> 04:55.020]  And also one physical server, particularly when you're doing test and development, is
[04:55.020 --> 05:00.220]  getting bigger and bigger in terms of consuming it, so giving people one whole server can
[05:00.220 --> 05:01.940]  be a problem.
[05:01.940 --> 05:07.140]  The other thing, and I'm speaking as a developer here before I bash developers, we love breaking
[05:07.140 --> 05:08.140]  things.
[05:08.140 --> 05:12.180]  So if you give people access to the kernel and they're going crazy and they crash the
[05:12.180 --> 05:18.260]  kernel, if it's just your little kernel, then it's only you you've just crashed.
[05:18.260 --> 05:20.620]  That's a bit of an extreme example to be fair.
[05:20.620 --> 05:23.900]  I don't really mean crashing the kernel, I more mean crashing the thing that you put
[05:23.900 --> 05:27.900]  in the kernel more likely to be particular.
[05:27.900 --> 05:30.460]  Anyway, how do we separate this up?
[05:30.460 --> 05:35.060]  And actually probably a better analogy rather than pizza is sort of a reconfigurable conference
[05:35.060 --> 05:36.220]  room.
[05:36.220 --> 05:39.820]  So if you plan ahead, you can make this kind of change.
[05:39.820 --> 05:46.220]  So sometimes you want to use all of the room for a really big meeting, like this one.
[05:46.220 --> 05:49.500]  Sometimes you want to divide it up, and when you divide it up, you kind of want a certain
[05:49.500 --> 05:55.220]  amount of isolation, and not accidentally, you can get the noisy neighbor problem in
[05:55.220 --> 05:57.220]  these setups.
[05:57.220 --> 06:05.020]  So you have to be careful about actually how you're doing that dividing.
[06:05.020 --> 06:10.260]  And so one of the things that's also changed most recently is how do we get these reusable
[06:10.260 --> 06:11.820]  bits of infrastructure.
[06:11.820 --> 06:16.420]  So I said we've got a well reusable platforms on top of the infrastructure.
[06:16.420 --> 06:19.660]  So one of the things I said about the IRIS project is it's working a lot with international
[06:19.660 --> 06:22.780]  communities coming with a thing to run.
[06:22.780 --> 06:27.980]  Very often these days that thing to run is packages and Kubernetes.
[06:27.980 --> 06:32.940]  Sometimes people are developing on Kubernetes on their laptops, and they need a bigger Kubernetes,
[06:32.940 --> 06:35.700]  but this is certainly becoming a thing now.
[06:35.700 --> 06:40.300]  People just say, you know, whereas this is how I'm wanting to deploy, how do I carve
[06:40.300 --> 06:45.340]  out the Kubernetes infrastructure and have Kubernetes on top of it to do what I need
[06:45.340 --> 06:46.340]  to do?
[06:46.340 --> 06:52.620]  And actually it's been very helpful in terms of giving us a higher level of abstraction
[06:52.620 --> 06:58.060]  that we're working with to kind of, you know, to package up web applications and interactive
[06:58.060 --> 07:02.940]  applications and a whole manner of things.
[07:02.940 --> 07:12.740]  Okay, so the next piece in the topic was why RDMA networking or why random access, remote
[07:12.740 --> 07:13.740]  direct memory access.
[07:13.740 --> 07:17.220]  I can remember that, so I put it in there.
[07:17.220 --> 07:21.500]  I thought to try and prove my point, I'd show a pretty graph.
[07:21.500 --> 07:23.020]  This is open foam.
[07:23.020 --> 07:28.860]  At the bottom here, there's a link to the tool that we use to actually run these benchmarks,
[07:28.860 --> 07:32.740]  and to make it nice and repeatable.
[07:32.740 --> 07:37.820]  Essentially you can describe in a Kubernetes CRD the kind of benchmark you want to run,
[07:37.820 --> 07:44.300]  and then it basically submits a job to Volcano, monitors the output and just tells you what
[07:44.300 --> 07:45.300]  the output of that was.
[07:45.300 --> 07:50.900]  It's just a way of just making it nice and quickly reproducible.
[07:50.900 --> 07:56.860]  So if you look at this graph, it's showing basically wall clock time for the simulation.
[07:56.860 --> 08:02.180]  And on these lines, we've got lots of different networking technologies that were being tested
[08:02.180 --> 08:08.140]  out, and not unsurprisingly the ones that were performing the best have all got the
[08:08.140 --> 08:13.820]  lowest wall clock time, so the best result in this particular benchmark.
[08:13.820 --> 08:20.660]  As you can see, this was probably an interesting configuration in the sense that as you were
[08:20.660 --> 08:24.540]  scaling out the compute, there was actually no benefit at all in terms of the simulation
[08:24.540 --> 08:25.540]  time.
[08:25.540 --> 08:34.060]  Actually, interestingly, because of this slightly wackadoodle configuration, or the job was
[08:34.060 --> 08:39.460]  too small essentially, you can actually see in the TCP ones above, they gradually actually
[08:39.460 --> 08:46.020]  get worse as they've got more cross communication, as you would expect with MPI underneath here.
[08:46.020 --> 08:53.140]  So if we dive down into MPI, on the left-hand side we've got the latencies, and these bottom
[08:53.140 --> 08:59.660]  two latencies for people at the back of the room, there's two at about five microseconds
[08:59.660 --> 09:02.460]  and one that's about half of that.
[09:02.460 --> 09:06.020]  These are interesting, these are the RDMA ones.
[09:06.020 --> 09:13.260]  Actually, I'm saying RDMA here, these are actually all rocky using Ethernet, as you
[09:13.260 --> 09:17.740]  probably guessed, because I just said what the latencies were, if you're interested in
[09:17.740 --> 09:19.740]  that kind of thing.
[09:19.740 --> 09:30.540]  So there's no such thing as a free coffee unless you're at Fosden, I guess, but let's
[09:30.540 --> 09:35.100]  just compare very briefly those three technologies.
[09:35.100 --> 09:38.900]  If we have a look at the bandwidth, there's something interesting happening here.
[09:38.900 --> 09:42.380]  It would be slightly more interesting if we'd actually had the hardware for long enough
[09:42.500 --> 09:46.820]  and run the rest of the points, but you can see that the one with the lowest latency actually
[09:46.820 --> 09:52.740]  caps out about 100 gigabits a second, and the ones with a slightly higher latency, or
[09:52.740 --> 10:00.180]  double if you're being mean, actually go all the way up to the 200 gigabits a second, and
[10:00.180 --> 10:04.460]  actually there's a difference in the way in which that's been wired up, which I'll go
[10:04.460 --> 10:09.060]  into in a bit more detail later, but essentially one of them can use the whole bond, and one
[10:09.060 --> 10:11.420]  of them can only use one side of the bond.
[10:11.460 --> 10:15.580]  So these were on service with bonded 100 gig ethernet.
[10:15.580 --> 10:20.460]  If you pay a latency penalty, you can use both sides of the bond in an interesting way.
[10:20.460 --> 10:26.460]  If you want the ultimate lowest latency, you kind of have to dedicate and just use one
[10:26.460 --> 10:29.460]  side of the bond.
[10:29.460 --> 10:35.180]  Anyway, so why do you make a big difference to these kind of workloads?
[10:35.180 --> 10:40.380]  I'm referencing a talk here that was at KubeCon, five ways with a CNI.
[10:40.380 --> 10:47.100]  If you look at the FOSDEM session information for this talk, one of the links on there is
[10:47.100 --> 10:50.980]  to a blog that we wrote about this kind of thing, and there's a video from KubeCon you
[10:50.980 --> 10:57.740]  can watch to have more detail, and this particular set of bang for bang, how's all these different
[10:57.740 --> 11:03.140]  ways of wiring the networks.
[11:03.140 --> 11:06.780]  So that all sounded a bit complicated, right?
[11:06.780 --> 11:12.260]  How do we actually stamp this out in a kind of useful way for users and get this all tied
[11:12.260 --> 11:15.260]  together?
[11:15.260 --> 11:20.140]  So how do we manage that operational complexity?
[11:20.140 --> 11:27.380]  So the first side of this is in terms of deploying at the OpenStack layer and configuring all
[11:27.380 --> 11:33.820]  of that, we've got tools from the OpenStack community, from the Collar community in particular,
[11:33.860 --> 11:39.020]  Kube and Collar Ansible, and we use those with Ansible playbooks to sort of repeatedly,
[11:39.020 --> 11:42.420]  once you've got a working configuration, make sure you do that every time.
[11:42.420 --> 11:48.500]  It involves ensuring you can re-image the machines easily and make sure that you apply
[11:48.500 --> 11:53.820]  the Ansible on there and get the same thing each time, so sort of package that up, and
[11:53.820 --> 11:59.260]  that is all open for people to reuse.
[11:59.260 --> 12:02.860]  And then the next stage is the users need to actually consume this infrastructure.
[12:02.900 --> 12:09.860]  So if we give people OpenStack directly, they can get very confused, because the people
[12:09.860 --> 12:16.420]  that are trying to just create a platform are typically not experts in using cloud infrastructure.
[12:16.420 --> 12:18.500]  So how do we make that easier?
[12:18.500 --> 12:22.060]  So I want to talk about azimuth.
[12:22.060 --> 12:26.900]  This is the project that I mentioned at the beginning coming from the Jasmine team, and
[12:26.900 --> 12:32.420]  the idea here is for the people creating platforms, so for the platform creators, people who want
[12:32.460 --> 12:37.220]  to create a Jupyter Hub or a Dask Hub or a Slurm cluster that's isolated and dedicated
[12:37.220 --> 12:38.220]  for their own needs.
[12:38.220 --> 12:44.060]  This might be for a development use case or otherwise, or create a Kubernetes cluster.
[12:44.060 --> 12:50.500]  How do we just package up those good practices and make that really easy to deploy, so calling
[12:50.500 --> 12:53.580]  this platform as a service?
[12:53.580 --> 12:58.420]  If you've seen me talk about this before, one of the changes here is that you get all
[12:58.420 --> 13:05.260]  of the platforms in one view now, so you can log in using your OpenStack credentials.
[13:05.260 --> 13:09.420]  So there's the cloud operator, and then there's the platform operator logs into azimuth, creates
[13:09.420 --> 13:14.900]  the platform, then on top of the platform, you can choose which users can log into that,
[13:14.900 --> 13:23.260]  just to make all of that much easier to do.
[13:23.260 --> 13:28.020]  So I'll quickly go through the types of things that are going on here and the different types
[13:28.020 --> 13:30.780]  of platforms.
[13:30.780 --> 13:34.740]  So firstly, there's Ansible-based platforms.
[13:34.740 --> 13:40.500]  So things like, give me a bigger laptop, which is a particular case, and so give me a Linux
[13:40.500 --> 13:46.380]  workstation that I can just guacamole into, or then give me a Slurm cluster.
[13:46.380 --> 13:49.180]  What we do for that is they're not Kubernetes-based.
[13:49.180 --> 13:56.380]  We use Terraform to stamp out virtual machines, and then there's Ansible, basically, Ansible's
[13:56.380 --> 14:03.260]  running Terraform to stamp out machines and do any final configuration that might be required.
[14:03.260 --> 14:06.460]  So when you click the button, all of that happens in the background, and it sets up
[14:06.460 --> 14:10.500]  the infrastructure and you can get straight in.
[14:10.500 --> 14:15.740]  The other type is, give me a Kubernetes cluster, I go into this in a bit more detail in a sec,
[14:15.740 --> 14:21.300]  but you choose your Kubernetes cluster, set that up, and it stamps that out.
[14:21.300 --> 14:27.620]  And the third type, which is relatively new now, is, well, I just want a Jupyter Hub or
[14:27.620 --> 14:34.420]  a Dask Hub, and so for those kind of situations, we're deploying those on the Kubernetes cluster,
[14:34.420 --> 14:36.740]  so you can go through that.
[14:36.740 --> 14:38.620]  So let's go into a bit more detail.
[14:38.620 --> 14:43.860]  This is more just a bit of an eye chart, particularly because it's not rendering at all.
[14:43.860 --> 14:47.140]  The idea is you just ask some basic questions about creating a Kubernetes cluster, what
[14:47.140 --> 14:51.260]  size nodes you want, what name it is.
[14:51.260 --> 14:57.020]  If you're creating your Kubernetes application, and you've pressed go into the Kubernetes application,
[14:57.020 --> 15:02.540]  you give it a name and the basic constraints for the notebooks, and it's sort of pre-configured
[15:02.540 --> 15:06.700]  and you tell it which Kubernetes cluster to put it on, or create one if you haven't got
[15:06.700 --> 15:11.020]  one yet.
[15:11.020 --> 15:16.020]  And then finally, when you've stamped out all of these bits of infrastructure, you can
[15:16.020 --> 15:22.540]  see there's a nice single sign-on to go and dig in.
[15:22.540 --> 15:28.060]  So if you've got Dask Hub, you can click on the link to sort of open your notebook,
[15:28.060 --> 15:31.380]  and it gets you straight in.
[15:31.380 --> 15:35.460]  One of the issues we've got at the moment is that there's a cost of IPv4 addresses or
[15:35.460 --> 15:38.740]  the shortage of IPv4 addresses is a big deal.
[15:38.740 --> 15:43.460]  So we're actually using a Zenith proxy here, a tunneling proxy called Zenith.
[15:44.020 --> 15:50.140]  So essentially when we create the infrastructure, there's an SSH session poking out, doing a
[15:50.140 --> 15:57.060]  port forward essentially, out into the proxy, and the proxy secures that it does all the
[15:57.060 --> 16:01.660]  authentication and authorization, and then punches that through.
[16:01.660 --> 16:08.420]  So essentially it means that these are inside, you've got a VM inside your private network,
[16:08.420 --> 16:13.420]  and then it goes out through the NAT, not consuming floating IPs for each of these bits
[16:13.420 --> 16:16.140]  of infrastructure that you're stamping out.
[16:16.140 --> 16:22.820]  And there's lots of, I'm not going to go into too much detail on all these things.
[16:22.820 --> 16:27.220]  If you create a Kubernetes cluster, it's easy to get the kubectl out.
[16:27.220 --> 16:34.500]  It's got monitoring included, and SLIM, similarly, it comes with monitoring open-on-demand dashboards.
[16:34.500 --> 16:39.220]  So in this case, you can get in and out through open-on-demand, although this one does require
[16:39.220 --> 16:43.060]  a public IP so that you can do SSH.
[16:43.060 --> 16:48.740]  I said about bigger desktop, so if you just want a VM, you can get into without worrying
[16:48.740 --> 16:51.700]  about SSH, without having to configure all that.
[16:51.700 --> 16:56.740]  You can go in through Guacamole, get a web terminal and otherwise.
[16:56.740 --> 17:03.420]  Again, you can stamp out all of these without consuming a floating IP.
[17:03.420 --> 17:07.580]  Another mode, which is a bit like Binderhub, but just inside a single VM, is just you specify
[17:07.580 --> 17:09.540]  your repo to Docker.
[17:09.540 --> 17:10.540]  Same kind of idea.
[17:10.540 --> 17:14.540]  It spins up the Jupyter Notebook, punches it out with Zenith, so it's all nice and simple
[17:14.540 --> 17:16.340]  to just get that up and running.
[17:16.340 --> 17:23.700]  Okay, so let's do a little bit of a shortish technical dive into actually how do you get
[17:23.700 --> 17:29.060]  RDMA in Loki, what the heck is Loki, you may have said.
[17:29.740 --> 17:34.860]  If you've been in some of the open-infotalks, Thierry described this quite well.
[17:34.860 --> 17:41.980]  This is the idea of Linux OpenStack and Kubernetes, giving you dynamic infrastructure.
[17:41.980 --> 17:43.660]  How do we get RDMA into this stack?
[17:43.660 --> 17:46.140]  There's three main steps.
[17:46.140 --> 17:51.220]  First of all, you do need RDMA in the OpenStack servers that you're creating.
[17:51.220 --> 17:55.980]  Second step is, if you want Kubernetes, you need the Kubernetes clusters on those OpenStack
[17:55.980 --> 17:56.980]  servers.
[17:57.300 --> 18:03.220]  The third step is you need RDMA inside the Kubernetes pods, executing within the Kubernetes
[18:03.220 --> 18:04.220]  clusters.
[18:04.220 --> 18:06.780]  So let's just drill down into each of those.
[18:06.780 --> 18:09.620]  So how do we do RDMA inside the OpenStack servers?
[18:09.620 --> 18:12.620]  Well, there's two main routes here.
[18:12.620 --> 18:22.020]  The first route is if it's a bare metal server, you've got the nick there, RDMA is generally
[18:22.020 --> 18:24.780]  available in the way it's normally available.
[18:25.580 --> 18:27.780]  This is not a lot special to do there.
[18:27.780 --> 18:28.940]  I should stop there for a moment.
[18:28.940 --> 18:32.900]  What I've said is you're using the standard OpenStack APIs and all the Terraform tooling
[18:32.900 --> 18:34.740]  and you're stamping out bare metal machines.
[18:34.740 --> 18:36.060]  That's totally possible.
[18:36.060 --> 18:43.060]  When you select the flavor dropdown, it might be give me a box with an 8A100 on it.
[18:43.060 --> 18:44.380]  I want the whole thing.
[18:44.380 --> 18:46.380]  That's perfectly possible.
[18:46.380 --> 18:50.900]  So I referenced Cambridge as helping us out with this.
[18:51.020 --> 18:56.620]  Cambridge's HPC clusters are actually deployed on OpenStack using the bare metal orchestration.
[18:56.620 --> 19:03.060]  So it doesn't get in the way of anything in terms of RDMA or InfiniBand or whatever.
[19:03.060 --> 19:05.980]  You get the bare metal machine.
[19:05.980 --> 19:09.300]  On the VM side, it's a little bit more complicated.
[19:09.300 --> 19:13.980]  Essentially, the easiest way to get RDMA working in there is that we pass in an actual nick
[19:13.980 --> 19:17.300]  using PCI Passu, the SROV.
[19:17.300 --> 19:23.340]  So the VM itself has to have drivers appropriate for the nick that you've passed through.
[19:23.340 --> 19:27.740]  Now there's a whole bunch of different strategies for doing that, but I wanted to quickly go
[19:27.740 --> 19:35.180]  through this one, which is using specifically in some MeloLux cards, and there are other
[19:35.180 --> 19:36.700]  ways of doing this.
[19:36.700 --> 19:41.100]  Essentially you do OVS offload onto your virtual function.
[19:41.100 --> 19:49.660]  So if you do SROV into the VM, that virtual function can actually get attached into OVS.
[19:49.660 --> 19:53.260]  Now that sounds insane because that's a really slow path and you just put a nice fast thing
[19:53.260 --> 19:54.900]  into a slow path.
[19:54.900 --> 20:02.140]  What happens is OVS gets told that actually you look for hardware offloaded flows.
[20:02.140 --> 20:06.420]  So when you actually start getting connections going into your different machines, it notices
[20:06.420 --> 20:15.540]  the MAC and IP address pairs and those flows in OVS get put into the hardware and then
[20:15.540 --> 20:17.620]  it goes onto a fast path.
[20:17.620 --> 20:24.940]  The other part of this is that you connect the OVS directly to your bond on the host
[20:24.940 --> 20:29.020]  and the VFs are actually getting connected to the bond.
[20:29.020 --> 20:32.260]  So in that earlier graph where I was showing 200 gigabits a second and basically getting
[20:32.260 --> 20:38.100]  line rate, that's using this setup where essentially your VM with its virtual function
[20:38.100 --> 20:42.580]  is going through the bond rather than through one of the individual interfaces.
[20:42.580 --> 20:48.380]  And this is actually quite a nice setup in terms of wiring.
[20:48.380 --> 20:53.180]  So if you've got a server that's got dual 100 gig ethernet going in or dual 25 gig
[20:53.180 --> 20:57.900]  ethernet, you don't have to dedicate one of those ports to SROV.
[20:57.900 --> 21:01.620]  You have the host bond on there and you can connect the virtual functions into the host
[21:01.620 --> 21:05.020]  bond.
[21:05.020 --> 21:10.820]  Okay, so the next bit, create Kubernetes.
[21:10.820 --> 21:14.220]  I'm not going to go into that too much detail.
[21:14.220 --> 21:17.620]  Essentially we're using Cluster API.
[21:17.620 --> 21:24.300]  I really like its logo because it uses basically you create a management cluster.
[21:24.300 --> 21:30.500]  In CRDs you describe what you want your HA cluster to be or your other cluster to be
[21:30.500 --> 21:33.580]  and it stamps that out for using an operator.
[21:33.580 --> 21:39.300]  This has proved to be really quite a stable way and reliable way of creating Kubernetes.
[21:39.300 --> 21:46.780]  One part of this is that we're actually hoping to try and, well, while I'm in the room, I'm
[21:46.780 --> 21:52.300]  trying to fix the unit test on it, but we're developing a Magnum driver for OpenStack Magnum
[21:52.300 --> 21:56.500]  to actually consume Cluster API and just stamp them out.
[21:56.500 --> 22:03.980]  To make this repeatable, it's all been packaged up in Helm charts, which are here.
[22:03.980 --> 22:10.660]  So now we've got OpenStack machines that have got RDMA in, we can do that.
[22:10.660 --> 22:15.100]  We've set up a Kubernetes cluster that's using those OpenStack machines that have the virtual
[22:15.100 --> 22:18.100]  function in that's doing RDMA at line rate.
[22:18.100 --> 22:24.980]  Now how on earth do we get the Kubernetes pods to actually make use of RDMA?
[22:24.980 --> 22:30.860]  Now if this was a bare metal machine, there's actually quite a lot of standard patterns.
[22:30.860 --> 22:34.060]  It seemed to be quite well documented in terms of actually using virtual functions into the
[22:34.060 --> 22:35.060]  pod.
[22:35.060 --> 22:40.300]  If we're inside a VM, we've already done the PF to VF translation, so you can't go again.
[22:40.300 --> 22:46.460]  You can't have a VVF yet, although VDPA and other things might change this.
[22:46.460 --> 22:52.020]  So what we're actually doing is we're using Maltis and something called the Mac VLAN CNI.
[22:52.020 --> 22:56.780]  So essentially when you create your Kubernetes pod, you give it two interfaces, your regular
[22:56.780 --> 23:04.340]  CNI interface, so that has all the usual smarts, and you give it an additional Mac IP address
[23:04.340 --> 23:08.020]  pair on your virtual function for the VM.
[23:08.020 --> 23:12.860]  Now at the moment, you have to turn off port security to ensure that those extra Macs that
[23:12.860 --> 23:18.740]  are auto-generated inside Kubernetes are punching out correctly and not restricted by the virtual
[23:18.740 --> 23:19.740]  function.
[23:19.740 --> 23:23.780]  There's a plan to try and orchestrate that, so you can use allowed address pairs to explicitly
[23:23.780 --> 23:25.780]  decide which ones.
[23:25.780 --> 23:31.100]  But that's basically, so you use Maltis to say, give me two network connections, and
[23:31.100 --> 23:37.660]  you use Mac VLAN to get that connection to your RDMA.
[23:37.660 --> 23:42.260]  And there's also some permission stuff, which is actually quite a simple decorator on the
[23:42.260 --> 23:43.260]  pod.
[23:43.260 --> 23:51.220]  But essentially, extra pod YAML to opt in to actually how to get this all wired together.
[23:51.220 --> 23:58.540]  Okay, so it'd be really great if people have these problems, and this is interesting, to
[23:58.540 --> 23:59.540]  get involved.
[23:59.540 --> 24:06.220]  There's a whole load of links, but yeah, thank you very much.
[24:06.220 --> 24:21.980]  And before you've got time for half a question.
[24:21.980 --> 24:31.980]  Yeah, you mentioned, I thought you mentioned that we are doing the bond on the network
[24:31.980 --> 24:32.980]  interface.
[24:32.980 --> 24:35.540]  You're getting the full bandwidth of the bond.
[24:35.540 --> 24:40.340]  So whenever I do LICP bonding, any particular connection, I only get half the interface.
[24:40.340 --> 24:42.580]  So I'm just wondering how you're doing that.
[24:42.580 --> 24:44.060]  It depends on your bonding mode.
[24:44.060 --> 24:47.300]  But yeah, so with LICP bonding, I only say make it half.
[24:47.300 --> 24:51.580]  Well, no, so there's a hashing mode on your bond.
[24:51.580 --> 24:57.220]  So what you need to make sure is that you do something like L3 plus L4 hashing, so that
[24:57.220 --> 25:02.460]  from a single client, it depends, it's basically, each of your traffic flows gets hashed onto
[25:02.460 --> 25:06.140]  a different bit of the bond.
[25:06.140 --> 25:10.540]  So you need drivers that are respecting that hashing function.
[25:10.540 --> 25:16.540]  But yeah, if you get enough different flows, then it will actually hash across the bond,
[25:16.540 --> 25:17.540]  okay.
[25:17.540 --> 25:19.420]  It's all about the hashing modes.
[25:19.420 --> 25:23.620]  Not all switches support all hashing modes, which is the gotcha in that.
[25:23.620 --> 25:27.620]  Yeah, the other question I have is, I don't understand the connection between MAC VLAN
[25:27.620 --> 25:28.620]  and RDNA.
[25:28.620 --> 25:29.620]  Sorry, what's that?
[25:29.620 --> 25:32.300]  The connection between MAC VLAN and RDNA.
[25:32.300 --> 25:34.300]  The connection between the MAC VLAN and RDNA.
[25:34.300 --> 25:39.380]  Yeah, why do you need MAC VLAN to do the RDNA into your VMs?
[25:39.380 --> 25:41.900]  So you could just do host networking.
[25:41.900 --> 25:45.980]  So if you did host networking on the pod, you would just have access to all of those host
[25:45.980 --> 25:46.980]  interfaces.
[25:46.980 --> 25:51.180]  But if you want to have multiple different RDNA flows with different MAC and IP address
[25:51.180 --> 25:56.580]  pairs, then the MAC VLAN allows you to have those multiple pods, each with their own identity
[25:56.580 --> 26:00.540]  on your VLAN that's doing RDNA.
[26:00.540 --> 26:05.100]  Anyway, emails for the next questions, I think.
[26:05.100 --> 26:09.660]  So I should let the next person set up.
[26:09.660 --> 26:11.060]  Any other questions for Joan?
[26:11.060 --> 26:12.060]  Can it?
[26:12.060 --> 26:13.060]  Oh.
[26:13.060 --> 26:14.060]  Yeah, last one.
[26:14.060 --> 26:20.060]  Actually, I had two, but okay, I'll re-work it.
[26:20.060 --> 26:21.060]  Okay.
[26:21.060 --> 26:24.060]  So I saw that you were also creating slurm clusters.
[26:24.060 --> 26:25.060]  Yes.
[26:25.060 --> 26:31.780]  So how does Kubernetes and slurm play together for the network topology and placement of
[26:31.780 --> 26:32.780]  your networks?
[26:32.780 --> 26:35.780]  Well, I have lots of ideas for that after your talk.
[26:35.780 --> 26:37.220]  At the moment, not really.
[26:37.220 --> 26:40.460]  So they just, the pods get placed wherever and then...
[26:40.460 --> 26:42.780]  Yeah, at the moment, they're totally isolated environments.
[26:42.780 --> 26:46.780]  So you stamp out a slurm cluster and it's your own to do what you need.
[26:46.780 --> 26:53.780]  And then super briefly, the pink line that was legacy RDNA, SR, IO virtualization.
[26:54.780 --> 26:55.780]  Yes.
[26:55.780 --> 26:59.500]  Is that bare metal or is that also virtualized?
[26:59.500 --> 27:01.000]  That was...
[27:01.000 --> 27:04.000]  Is it running on...
[27:04.000 --> 27:05.000]  Because that was...
[27:05.000 --> 27:08.000]  We can catch up later.
[27:08.000 --> 27:09.000]  Okay.
[27:09.000 --> 27:15.020]  So that specific scenario, I definitely recommend watching Stigt Alpha's talk, they're five
[27:15.020 --> 27:16.780]  ways on CNI.
[27:16.780 --> 27:22.220]  I think that particular setup was actually bare metal with a virtual function.
[27:22.220 --> 27:26.380]  So it was actually Kubernetes on bare metal with the virtual function passed into the
[27:26.380 --> 27:27.380]  container.
[27:27.380 --> 27:28.380]  Right.
[27:28.380 --> 27:35.380]  I believe we got similar results without doing that legacy path into the VM as well.
[27:36.180 --> 27:41.580]  The extra cost, I believe, is on the VF lag piece because there's an extra bit in routing
[27:41.580 --> 27:48.180]  inside the silicon, I believe, but I'm not certain on that, so I'd have to check.
[27:48.180 --> 27:49.180]  Thank you.
[27:49.180 --> 27:50.180]  Pleasure.
[27:50.980 --> 27:51.980]  Thank you very much, John.
[27:51.980 --> 27:52.980]  Thank you.