[00:00.000 --> 00:08.000]  All right, we'll get started.
[00:08.000 --> 00:12.920]  We have another talk on MPI, but I think a very different one, running MPI applications
[00:12.920 --> 00:14.520]  on the Toro Unicolon.
[00:14.520 --> 00:15.520]  Exactly.
[00:15.520 --> 00:16.520]  Yeah.
[00:16.520 --> 00:17.520]  So, hello, everyone.
[00:17.520 --> 00:18.520]  I'm Matthias.
[00:18.520 --> 00:22.840]  Here, I'm going to talk about running MPI applications on Toro Unicolonial.
[00:22.840 --> 00:28.320]  Usually speaking, a Unicolonial is a way to deploy a user application in a way that is
[00:28.320 --> 00:34.280]  closer to the hardware by trying to reuse the operating system interference.
[00:34.280 --> 00:38.960]  So, in the overall, it should perform better than just deploying a user application by
[00:38.960 --> 00:41.960]  using a, during our proposed operating system.
[00:41.960 --> 00:48.640]  First, I would like to introduce myself while I am passionate about operating system development
[00:48.640 --> 00:50.680]  and mutualization technologies.
[00:50.680 --> 00:56.040]  I had worked at Citrix, to take Huawei, and I'm currently a Baptist, and here I have my
[00:56.040 --> 01:03.160]  email and my GitHub profile, if you want to know more about why I'm doing it.
[01:03.160 --> 01:07.920]  So, I'm going to start to present what is exactly a Unicolonial, and then I'm going
[01:07.920 --> 01:15.360]  to go to the details of what makes Toro special, and then I will show current implementation
[01:15.360 --> 01:22.640]  of the MPI standard on top of Toro, and I will finish with a benchmark that I'm trying
[01:22.640 --> 01:27.400]  to do to see if the current implementation is working as expected, or if there are things
[01:27.400 --> 01:30.680]  that could be improved.
[01:30.680 --> 01:34.600]  So, maybe you are already familiar with this picture.
[01:34.600 --> 01:39.120]  This is more or less how a user application is deployed, either using a built-in machine
[01:39.120 --> 01:40.120]  or bare metal.
[01:40.120 --> 01:44.080]  So, what we have is the operating system, the user application, and the two different
[01:44.080 --> 01:49.120]  modes, the RIN 3.0, which is the different modes in the X86 processor.
[01:49.120 --> 01:55.920]  So, in general, what we have is that when a user application requires one to open a
[01:55.920 --> 01:59.920]  file, send a packet, or whatever, it's going to trigger a Cisco, and then it's going to
[01:59.920 --> 02:07.800]  be a switch in which the processor is running from, which is user space to kernel space,
[02:07.800 --> 02:13.600]  so it's going to be processed here, kernel space, and come back, right?
[02:13.600 --> 02:18.360]  In general, when we see what we have inside the kernel is, well, we have different components,
[02:18.360 --> 02:22.480]  right, that, for example, have the scheduler, the file system, different drivers, and so
[02:22.480 --> 02:23.480]  on.
[02:23.480 --> 02:28.440]  So, in particular, when we have a scheduler, a scheduler is going to choose what is the
[02:28.440 --> 02:31.160]  next process that's going to be executed.
[02:31.160 --> 02:35.480]  One of these processes, or several of them, is going to be your MPI application, for example.
[02:35.480 --> 02:42.920]  So, if you deploy your MPI application in a, by using a general proposal, but as a system,
[02:42.920 --> 02:46.400]  your application is going to compete with other processes in the system for sure.
[02:46.400 --> 02:50.760]  And also, what you have in the scheduler is some policy, which is going to decide which
[02:50.760 --> 02:55.080]  is the next process to be deployed.
[02:55.080 --> 03:00.840]  Also, we have components like the file system, and since we have a general proposal processing
[03:00.840 --> 03:05.440]  system, we're going to have several drivers for different file systems, and different
[03:05.440 --> 03:07.600]  drivers, and so on.
[03:07.600 --> 03:14.600]  So, what some people observed was that there were too much generality in using a general
[03:14.600 --> 03:21.120]  proposal processing system for a single proposed application, like, can be an MPI.
[03:21.120 --> 03:28.240]  So, some people come up with a new architecture, they propose what they call Unicolonial.
[03:28.240 --> 03:33.400]  You have some projects there, like OSV, Mirage OS, Unicrash, or Nano VMs.
[03:33.400 --> 03:38.520]  What they do is just take the user application and compile it within the kernel itself.
[03:38.520 --> 03:44.200]  So, at the end, what you have is a single binary that is going to be deployed, either
[03:44.200 --> 03:48.480]  by using a virtual machine or a bare metal, right?
[03:48.480 --> 03:54.080]  So, instead of, for example, having syscalls that we have in the case that we have a general
[03:54.080 --> 03:58.880]  proposal processing system and different modes of execution, in the case of a Unicolonial,
[03:58.880 --> 04:08.480]  we have simply calls, which are cheaper than using syscalls, for example.
[04:08.480 --> 04:12.280]  In general, the projects, the projects that I presented before all come forward to the
[04:12.280 --> 04:17.680]  epoxy standard, so it means that if you have any application written in C that come forward
[04:17.680 --> 04:24.400]  epoxy, you can theoretically compile with the Unicolonial without any modification of
[04:24.400 --> 04:25.880]  the user application.
[04:25.880 --> 04:30.080]  In reality, this does not happen, and most of the time, the epoxy that the Unicolonial
[04:30.080 --> 04:35.400]  implement is not complete, so you have to do some work to just, you cannot just take
[04:35.400 --> 04:39.360]  your application and compile it and generate something, it doesn't work out of the box
[04:39.360 --> 04:42.440]  in most of the cases, right?
[04:42.440 --> 04:48.360]  So, in this context, what is total is also a Unicolonial, it's an application-oriented
[04:48.360 --> 04:55.600]  Unicolonial, and the idea of total is to offer an API which is dedicated, I mean, to efficiently
[04:55.600 --> 04:59.040]  deploy parallel application.
[04:59.040 --> 05:04.360]  In the case of total, it is, it's not a epoxy complaint, it means that even if the nail of
[05:04.360 --> 05:09.400]  the functions like this opens in close and so on, it's more or less the same nail, the
[05:09.400 --> 05:14.240]  semantic of this function is slightly different, so I will not say that it's a epoxy complaint
[05:14.240 --> 05:18.440]  in that sense, and I will explain that later.
[05:18.440 --> 05:25.440]  So, let's say that the three building blocks of the total Unicolonial are the memory per
[05:25.440 --> 05:31.080]  core, cooperative scheduler, and core-to-core communication based on built IEA.
[05:31.080 --> 05:35.560]  Here I'm talking about the architect of the Unicolonial, I'm not talking about yet the
[05:35.560 --> 05:39.760]  application of how we're going to build an application to compile tutorial, right?
[05:39.760 --> 05:42.760]  And I'm going to explain these three points.
[05:42.760 --> 05:48.360]  So, first, what happened in the total Unicolonial is that we have memory dedicated per core,
[05:48.360 --> 05:54.160]  so at the beginning what we do is just allocate memory, I mean, to split the whole memory in
[05:54.160 --> 06:01.560]  rations and we assign these rations per core, and for the moment the size of these rations
[06:01.560 --> 06:06.480]  is just proportional to the number of cores that we have.
[06:06.480 --> 06:11.880]  That makes that, for example, the memory allocator is quite simple, it doesn't require any communication
[06:11.880 --> 06:18.480]  because we have chunks of data, I mean, yeah, the allocator is, we have one allocator per
[06:18.480 --> 06:22.160]  core which means that we don't require any synchronization in the kernel to allocate
[06:22.160 --> 06:23.160]  for one core.
[06:23.160 --> 06:28.440]  It's quite, we call it per CPU data, let's say, yeah.
[06:28.440 --> 06:33.640]  So, for example, if you have a thread in a core one and we want to locate memory, we're
[06:33.640 --> 06:37.440]  going always to get it from the same rations and that happens also if you're on the core
[06:37.440 --> 06:40.760]  two, we're going to use the rations, from rations two.
[06:40.760 --> 06:45.160]  And the idea is that by doing this, we can then leverage technologies like hypertransport
[06:45.160 --> 06:49.920]  or interquit path interconnect in which we can say, well, this core is going to access
[06:49.920 --> 06:54.200]  this rations of memory and if you access all the rations, it's going to get a penalty
[06:54.200 --> 06:56.160]  to do it, right?
[06:56.160 --> 07:02.720]  So, talking about the scheduler, what happened in total is that we only have thread, so we
[07:02.720 --> 07:09.720]  don't have process means that we, all threads share the view of the memory and we have mainly
[07:09.720 --> 07:14.920]  one API to create thread, it's called a begin thread and it's the parameter that have to
[07:14.920 --> 07:21.840]  say in which core each thread is going to run.
[07:21.840 --> 07:26.360]  The scheduler scoperity, which means that it is the thread that's going to call the
[07:26.360 --> 07:32.360]  scheduler to then choose another thread and this is by relying on the API call assist
[07:32.360 --> 07:37.600]  thread switch and most of the time, this is just in bucket because we are going to be
[07:37.600 --> 07:42.080]  idle for a while, so we just call the scheduler or we, for example, we're going to do some
[07:42.080 --> 07:45.480]  IO.
[07:45.480 --> 07:51.560]  So the scheduler is also very simple, we have, again, first CPU data, so we have one
[07:51.560 --> 07:58.200]  cube per core and the scheduler is simple going to choose the next thread that is ready
[07:58.200 --> 08:03.520]  and then the scheduler and this means that also we don't require any synchronization
[08:03.520 --> 08:08.600]  at the level of the kernel to schedule a thread, so it's like each core run independently
[08:08.600 --> 08:15.280]  one for another.
[08:15.280 --> 08:20.800]  Fine I am going to talk a bit about how we communicate cores and basically what we have
[08:20.800 --> 08:29.280]  is one dedicated reception cube per core for any other core in the system, so we have one
[08:29.280 --> 08:32.840]  to one communication.
[08:32.840 --> 08:38.680]  This basically relies on two primitives which is send and resist front and it's just by
[08:38.680 --> 08:47.080]  using the destination core and from where we want to get a packet, for example.
[08:47.080 --> 08:53.920]  These two primitives are the ingridients to then build more complicated APIs like MPI
[08:53.920 --> 08:59.160]  gutter, MPI because and MPI scatter, so these are the building blocks for those API, for
[08:59.160 --> 09:01.960]  example.
[09:01.960 --> 09:06.800]  So to implement this core-to-core communication, I was using butaio, so I was just following
[09:06.800 --> 09:13.320]  the specification, I will talk a little bit about this, I don't want to go too much into
[09:13.320 --> 09:19.080]  detail so as to understand roughly how communication between core is done.
[09:19.080 --> 09:26.080]  As I said before, we have one reception cube for each core, for any other core in the system,
[09:26.080 --> 09:33.280]  so means that, for example, if core one want to get packets from core two, we have a reception
[09:33.280 --> 09:39.520]  cube and also if core one want to send a packet to core two, it's going to have a transmission
[09:39.520 --> 09:44.280]  cube and the number of queues is going to be for sure different if you have three cores,
[09:44.280 --> 09:50.920]  for example, because the build queues are dedicated.
[09:50.920 --> 09:56.120]  So basically how a build queue works is made of three RINs buffers.
[09:56.120 --> 10:03.320]  So the first RIN buffer is the buffer which only contains descriptors to chunks of memory.
[10:03.320 --> 10:08.320]  The second buffer is the aviable RIN and the third buffer is the user RIN.
[10:08.320 --> 10:14.960]  Basically what happens is the aviable RIN is the buffers that the core one are exposing
[10:14.960 --> 10:16.640]  to core two.
[10:16.640 --> 10:22.320]  So if the core two want to send a packet to core one, it's going to get a buffer from
[10:22.320 --> 10:26.560]  aviable RIN, put the data and then put it again in the user RIN.
[10:26.560 --> 10:33.480]  This is basically how build.io works, just that if you are familiar with build.io, in
[10:33.480 --> 10:39.560]  this case, for example, the consumer of aviable RIN is the core two, but if, for example,
[10:39.560 --> 10:45.040]  if you are in a hypervisor and you're implementing some build.io device, the consumer is not
[10:45.040 --> 10:49.480]  going to be the core two, but it's going to be the device model, QMU, for example, and
[10:49.480 --> 10:50.840]  if you are familiar with that.
[10:50.840 --> 10:54.880]  But it's the same scheme.
[10:54.880 --> 11:00.960]  This allows that, for example, since we have one producer and one consumer, we can access
[11:00.960 --> 11:08.280]  to the build queue without any synchronization, I mean, at least if we have only one consumer.
[11:08.280 --> 11:13.840]  So you don't require any luck, for example, to access to the build queue.
[11:13.840 --> 11:18.880]  So yeah, I already talked too much, I don't know how much time I have left, but I wanted
[11:18.880 --> 11:25.320]  to show some examples about the implementation, maybe it's more fun that all the flyers should
[11:25.320 --> 11:26.320]  show.
[11:26.320 --> 11:32.960]  So what happened, how we, how we deploy an application by using Toiletware.
[11:32.960 --> 11:38.360]  We have the MPI application, this is a C application for the moment, and you compile it with a
[11:38.360 --> 11:46.080]  unit that's going just to link the application with some functions that are the implementation
[11:46.080 --> 11:53.040]  of the MPI, the MPI, for example, MPI Bicass, Gatter, and so on, it's implemented in this
[11:53.040 --> 11:55.440]  level in the MPI interface.
[11:55.440 --> 11:59.680]  And this unit is going to use some MPI from the unique kernel.
[11:59.680 --> 12:04.280]  So at the end, what you're going to get is an ELF and binary that could be used to deploy
[12:04.280 --> 12:09.200]  your application, either as a built-in machine or a bare metal.
[12:09.200 --> 12:15.760]  So you don't have any operating system intermediate there, you have only your application, the
[12:15.760 --> 12:20.880]  threads and so on, but you don't have nothing else.
[12:20.880 --> 12:25.320]  So if you want to see how it is deployed, if you get the MPI application that doesn't
[12:25.320 --> 12:30.920]  see what is going to happen, we're going to get the main and then instantiate it to one
[12:30.920 --> 12:41.600]  for every core in the system, as a thread.
[12:41.600 --> 12:46.640]  So to benchmark the current implementation, I'm not very familiar with the MPI where I
[12:46.640 --> 12:51.160]  was just coming from another domain, so I am not really familiar with how I had to benchmark
[12:51.160 --> 12:58.240]  such implementation, and so I choose the also microbenchmarks, maybe you know them, maybe
[12:58.240 --> 13:06.600]  not, and I just pick up one of them, like for example, MPI barrier, and I try to implement,
[13:06.600 --> 13:10.920]  which is quite simple, the benchmark itself is quite simple, so I decided to implement
[13:10.920 --> 13:11.920]  it.
[13:11.920 --> 13:20.080]  I could not take the benchmark as it is, I have to do some rework to make it work, and
[13:20.080 --> 13:25.720]  then my idea was to see how this behave when I was deploying this as a single VM, which
[13:25.720 --> 13:29.040]  many cores.
[13:29.040 --> 13:36.200]  The hardware that I use is this one, since I'm not familiar with the high performance
[13:36.200 --> 13:42.120]  computing work, I'm not really sure if this is a hardware that you often use, it's quite
[13:42.120 --> 13:54.520]  a new Intel, you can get it in Equinex, the price is four euros per hour.
[13:54.520 --> 13:59.400]  So I launched the test and I tried to measure things, so I was just measuring the latency
[13:59.400 --> 14:07.280]  of this, and I was taking into account the max latency through, I mean, over four, eight,
[14:07.280 --> 14:11.120]  sixteen, or thirty-two cores.
[14:11.120 --> 14:19.760]  I am getting values in the order of the microseconds, and then I found this paper, which was also
[14:19.760 --> 14:29.000]  using this benchmark to measure their platform, and I was saying, well, in this paper, they
[14:29.000 --> 14:39.160]  were reporting around 20 nanoseconds and 13 nanoseconds, sorry, this is nanoseconds,
[14:39.160 --> 14:46.680]  this is microseconds, not nanoseconds, sorry, in this platform.
[14:46.680 --> 14:56.200]  In any case, I will be very cautious about this graph, because I was getting a lot of
[14:56.200 --> 15:02.120]  variation in the numbers, most of the time, for example, I was trying in a machine with
[15:02.120 --> 15:09.200]  thirty-two cores, and the VM has already thirty-two BCPUs, so you should not test in
[15:09.200 --> 15:13.080]  that sort of machine, because one of the threads is going to compete with the others,
[15:13.080 --> 15:18.920]  with the main one of the hosts, so you should always test with less cores, less BCPU cores,
[15:18.920 --> 15:19.920]  physical cores.
[15:19.920 --> 15:27.040]  And, yeah, the idea is to continue doing this, I mean, improving the way I am measuring
[15:27.040 --> 15:32.440]  this, and also try maybe in different hardware, and at the same time, I found a lot of packs
[15:32.440 --> 15:36.320]  in the unicroner by doing this, so for example, at the beginning, I only support more or less
[15:36.320 --> 15:42.160]  four cores, so I went from four to thirty-two, well, it was a number in a constant, but anyway,
[15:42.160 --> 15:48.600]  I found many packs when I was doing this, so it is all, this is just a proof-of-concept
[15:48.600 --> 15:52.800]  and a work in progress, so you don't take it too serious, I am trying to say, I don't
[15:52.800 --> 16:01.840]  want to jump into any conclusion from this, and, yeah, it was fun to do, anyway.
[16:01.840 --> 16:17.720]  So that's all, I don't know if you have any questions.
[16:17.720 --> 16:19.560]  So you said this runs on bare metal.
[16:19.560 --> 16:20.560]  Sorry?
[16:20.560 --> 16:22.760]  The unicroner runs on bare metal.
[16:22.760 --> 16:23.760]  Yeah, they are some.
[16:23.760 --> 16:24.760]  How do you even install it?
[16:24.760 --> 16:27.320]  I mean, operating systems are kind of complicated, right?
[16:27.320 --> 16:28.320]  Sorry?
[16:28.320 --> 16:32.000]  How do you even install it on bare metal?
[16:32.000 --> 16:33.500]  Can you say that again?
[16:33.500 --> 16:35.960]  How do you install it on bare metal?
[16:35.960 --> 16:37.960]  How do you install it?
[16:37.960 --> 16:41.560]  Yeah, like if I had this, how would I install it on bare metal?
[16:41.560 --> 16:43.840]  Is there an installer or...?
[16:43.840 --> 16:44.840]  Installer, you mean?
[16:44.840 --> 16:45.840]  Yeah.
[16:45.840 --> 16:50.560]  Now, you can just use some device to boot from, for example.
[16:50.560 --> 16:51.560]  So it's bootable?
[16:51.560 --> 16:52.560]  Yeah, that's it.
[16:52.560 --> 16:53.560]  Yeah.
[16:53.560 --> 16:54.560]  Okay.
[16:54.560 --> 16:55.560]  Yeah.
[16:55.560 --> 16:56.560]  Well, yeah.
[16:56.560 --> 16:59.640]  There are many ways to do that, for example, you don't have to install it, for example.
[16:59.640 --> 17:09.440]  You can use from a device that is removable, for example, you don't need to install it.
[17:09.440 --> 17:10.440]  Any questions?
[17:10.440 --> 17:11.440]  Thanks.
[17:11.440 --> 17:12.440]  Thank you.
[17:12.440 --> 17:13.440]  Thank you.
[17:13.440 --> 17:14.440]  Thank you.
[17:14.440 --> 17:15.440]  Thank you.
[17:15.440 --> 17:16.440]  Thank you.
[17:16.440 --> 17:31.060]  Thank you.
[17:31.060 --> 17:43.020]  Thank you very much.