[00:00.000 --> 00:09.560] Hi, everyone. So it's my pleasure to introduce Babis and Anastasios. They're going to give [00:09.560 --> 00:15.480] you the talk on using VXL for hard acceleration in your kernels. Babis, please. [00:15.480 --> 00:20.200] So hello, everyone. I'm Babis. My actual name is Haraldos Minus, but you can just call me [00:20.200 --> 00:26.920] Babis. So we're going to give a talk about hardware acceleration and our effort to having [00:26.920 --> 00:33.920] some support in the unit kernels, and we do that with VXL. So, yeah. [00:33.920 --> 00:36.920] Yeah, Kim, oh, sorry. I forgot about that. [00:36.920 --> 00:37.920] Oh, okay. [00:37.920 --> 00:44.920] Oh, let's forget. Yeah, put that over here over there, and maybe you can just keep it here. [00:44.920 --> 00:52.040] Okay. So, yeah, we already heard from Simon, so we don't have to repeat what the unit kernels [00:52.040 --> 00:58.600] are. There are a lot of projects, and we know that they are promising. It's a promising technology. [00:58.600 --> 01:06.280] We can have very fast boot times, low memory footprint, and some increased security. We [01:06.280 --> 01:11.560] also know some of the use cases for unit kernels, which are usually traditional applications [01:11.560 --> 01:16.880] that you might have heard like web servers and stuff like that. But they have also been [01:16.880 --> 01:22.120] used for NFV, and we think that they are also a good fit for serverless and in general micro [01:22.120 --> 01:28.480] services deployments, either in the cloud or the aids. And we also think that they can [01:28.480 --> 01:33.960] also be a good fit for, especially in this case, for ML and AI applications. And that [01:33.960 --> 01:41.720] sounds a bit weird because, as we know, ML and AI workloads are quite huge and heavy. [01:41.720 --> 01:46.920] So, maybe you have heard about PyTorch, maybe you have heard about TensorFlow. We're not [01:46.920 --> 01:52.320] going to touch them, don't worry. But what we want to say here is that they are very, [01:52.320 --> 01:57.960] very heavy frameworks, very difficult to add support for them. And secondly, we know that [01:57.960 --> 02:07.000] these kind of applications are usually compute-intensive applications that can take a lot of resources. [02:07.000 --> 02:11.280] And for that exact reason, we see that there is also a shift in the hardware that exists [02:11.280 --> 02:16.760] in the data centers, not only in the data centers, but also in the aids. We see devices [02:16.760 --> 02:22.200] that are equipped with a lot of new processing units. Of course, we have the traditional [02:22.200 --> 02:32.800] FPGAs and CPUs, but we also have specialized processing units like TPUs and also some ASICs. [02:32.800 --> 02:39.800] And first of all, as we know, ML and AI workloads cannot be executed in Unicernals, that's for [02:39.800 --> 02:44.000] sure, because there is no support for these frameworks. And secondly, there is no support [02:44.000 --> 02:52.360] for hardware acceleration, so there is not really any benefit if we run it in a CPU. [02:52.360 --> 03:01.960] So, I will give a smaller, I'm going to go through the acceleration stack and how we [03:01.960 --> 03:08.360] can virtualize it with the current approaches. So, in general, what we have, it's pretty [03:08.360 --> 03:12.640] simple. Usually, you have an application which is written in an acceleration framework, [03:12.640 --> 03:18.480] it can be OpenCL, it can be CUDA, it can be TensorFlow, PyTorps, all of these frameworks. [03:18.480 --> 03:26.280] Usually underneath that, you have the operator for the GPU or maybe a runtime for FPGAs. [03:26.280 --> 03:32.240] And then you also have, of course, the device driver which resides inside the kernel. So, [03:32.240 --> 03:38.840] this is what we have to virtualize. And as we know, Unicernals are virtual machines, [03:38.840 --> 03:42.280] so we can use the same techniques that we have for virtual machines, we can also use [03:42.280 --> 03:48.520] them in Unicernals. Some of these techniques are hardware partitioning, para-virtualization [03:48.520 --> 03:57.760] and remote API. So, in the case of hardware partitioning, the hardware accelerator has [03:57.760 --> 04:05.760] the capability to partition itself and we assign this small part of the accelerator [04:05.760 --> 04:12.480] to the VM and the VM can access directly to the hardware accelerator. This has very [04:12.480 --> 04:16.800] good performance. On the other hand, we need to have the entire acceleration stack inside [04:16.800 --> 04:25.760] the VM from the device driver to the application, to the acceleration framework. There is also [04:25.760 --> 04:30.840] the case of also, I forgot to mention here, that this is something that it has to be supported [04:30.840 --> 04:38.240] from the device and a device driver needs also to be in the VM. And in the case of para-virtualization, [04:38.240 --> 04:44.320] these things are getting a bit better because we can have a generic, let's say, device. [04:44.320 --> 04:54.120] And then the hypervisor simply manages the accelerator and then we can have the request [04:54.120 --> 04:58.840] to the accelerator manage from the hypervisor so we don't need to have all these kind of [04:58.840 --> 05:04.520] different drivers for every accelerator inside the VM. On the other hand, we still need to [05:04.520 --> 05:10.520] have the vendor runtime and the application and acceleration framework. In the case of [05:10.520 --> 05:18.720] remote API, we even have a lighter approach. Everything is managed from the servers. This [05:18.720 --> 05:24.320] server might be even locally in the same thing or it can't be a remote server. And what happens [05:24.320 --> 05:32.200] here is that the acceleration framework intercepts the calls from the application and forwards [05:32.200 --> 05:39.960] them to the acceleration framework that resides on the server. This has some performance overhead, [05:39.960 --> 05:46.760] of course, because of the transport that happens. And it's also framework specific. So it has [05:46.760 --> 05:54.200] to be supported, like there is remote CUDA, for example, that supports it. So great, but [05:54.200 --> 05:59.200] what is the best for unicernals? In the case of hardware partitioning, this means that [05:59.200 --> 06:04.600] we have to port the entire software acceleration stack and every device driver to the unicernal, [06:04.600 --> 06:09.560] which is not a good and not an easy task. Again, in para-virtualization, things are [06:09.560 --> 06:15.000] a bit better. We have to port only maybe one driver, but still we need to port all these [06:15.000 --> 06:19.400] acceleration stack. In the case of a remote API, this is something that sounds much more [06:19.400 --> 06:26.960] feasible because we can port only, let's say, remote CUDA, only one framework. But how easy [06:26.960 --> 06:33.040] is that? And it's not easy because, as I said before, these kind of frameworks are huge. [06:33.040 --> 06:44.160] They have very, very big code base. They have dynamic linking, which comes in contrast with [06:44.160 --> 06:51.800] the unicernals and a lot of dependencies. So it's not going to be easy to be porting [06:51.800 --> 07:02.200] any existing unicernal framework right now. So for that, we think that VXL is suitable [07:02.200 --> 07:10.120] for unicernals. So I will give to Tasos to present a bit of how VXL is working. [07:10.120 --> 07:27.640] Thank you. Thank you. So hi from my side, too. I'm going to talk a bit about the framework [07:27.640 --> 07:39.120] that we're building. So we started working on VXL to actually handle the hardware acceleration [07:39.120 --> 07:48.280] virtualization in VMs. So it's not tailored to unicernals. We have been playing with semantically [07:48.280 --> 07:56.840] exposing hardware acceleration functionality from hardware acceleration frameworks to VMs. [07:56.840 --> 08:07.520] And the software stack is shown in the figure. We use a hardware agnostic API, so we expose [08:07.520 --> 08:17.680] the whole function call of the hardware accelerated operation. And we focus on the portability [08:17.680 --> 08:25.680] and on interoperability, meaning that the same binary code originating from the application [08:25.680 --> 08:32.360] can be executed in many type of architectures, and it is decoupled from the hardware specific [08:32.360 --> 08:40.840] implementation. A closer look to the software stack, so we have an application. This application [08:40.840 --> 08:51.960] consumes the VXL API, which has specific support, specific operations. These operations are [08:51.960 --> 09:01.560] mapped through a mapping layer through VXL RT to the relevant plugins, which are shown [09:01.560 --> 09:10.560] in greenish, and they actually are the glue code between the API calls and the hardware [09:10.560 --> 09:19.480] specific implementation, which in this figure resides in the external libraries layer. And [09:19.480 --> 09:32.280] then it's the hardware where it executes whatever there is in the external libraries. So digging [09:32.280 --> 09:44.680] a bit more into how VXL works, so the core library, the core component of VXL exposes [09:44.680 --> 09:52.600] the API to the application and maps the API calls to the relevant hardware plugins, which [09:52.600 --> 10:03.200] by the way are loaded at runtime. These plugins are actually glue code between the API, the [10:03.200 --> 10:10.680] API calls, and the hardware specific implementation. So for example, we have an API call of doing [10:10.680 --> 10:16.040] image classification, image inference in general. The only thing that the application needs [10:16.040 --> 10:25.000] to submit to VXL is, I want to do image classify, this is the image, this is the model, and [10:25.000 --> 10:30.840] the parameters, and blah, blah, blah. And this gets mapped to the relevant plugin implementation. [10:30.840 --> 10:37.800] For instance, in this figure, we can use the judgment inference image classification implementation, [10:37.800 --> 10:44.120] which translates these arguments and this operation to the actual judgment inference [10:44.120 --> 10:53.080] framework provided by NVIDIA that does the image classification operation. Apart from [10:53.080 --> 11:03.440] the hardware specific plugins, we also have the transport layer plugins. So imagine this [11:03.440 --> 11:10.920] same operation, the image inference, could be executed in a VM using a virtual plugin. [11:10.920 --> 11:19.720] So this information, the operation, the arguments, the models, everything will be transferred [11:19.720 --> 11:28.800] to the host machine that will use hardware plugin. So apart from the glue code for the [11:28.800 --> 11:41.440] hardware specific implementation, we also have the VM plugins. We also, some of the [11:41.440 --> 11:50.440] plugins and the API operation support a subset of acceleration frameworks, such as a tensor [11:50.440 --> 12:02.200] flow or PyTorch. And what I mentioned earlier about the virtual plugins, so essentially [12:02.200 --> 12:09.200] what happens is that the request of the operation and the arguments is forwarded to another [12:09.200 --> 12:21.360] instance of the VXL library, either on the hypervisor layer or on a socket interface. [12:21.360 --> 12:31.000] So we currently support two modes of operations. We have a VTIO driver and currently we support [12:31.000 --> 12:41.440] firecracker and chemo. So we load the driver on the VM. This driver transfers the arguments [12:41.440 --> 12:47.280] and the operation to the backend, to the chemo backend or the firecracker backend, which [12:47.280 --> 12:54.880] in turn calls the VXL library to do the actual operation. And the other option is using sockets. [12:54.880 --> 13:02.760] So we load a socket interface, a socket agent on the host. We have the VSOC plugin on the [13:02.760 --> 13:25.240] guest and they communicate over simple sockets. I'm going to hand over to Babi for the Unicernel [13:25.240 --> 13:26.240] stuff. [13:26.240 --> 13:34.040] So how can VXL be used in Unicernels? Actually, it's quite easy compared to any other acceleration [13:34.040 --> 13:41.840] framework that exists. And the thing is that the only thing that we need to do is just [13:41.840 --> 13:47.720] have that VXLRT that you see over there. That's the only thing that we need to port. And this [13:47.720 --> 13:52.800] is a very, very thin layer of a C code. It can be easily ported to any Unicernel that [13:52.800 --> 14:04.520] exists. And we, of course, we need some kind of a transport plugin for forward requests. [14:04.520 --> 14:09.680] So as Tasos already explained, usually the application is the same application that we [14:09.680 --> 14:15.000] can run in the host or in any container or in any VM can be also used in the Unicernel, [14:15.000 --> 14:21.760] the same node changes. And it simply uses a specific API of VXL and then we simply forward [14:21.760 --> 14:26.000] the request to the host and then we have another version of VXL which is in the host [14:26.000 --> 14:32.240] and simply maps to the hardware accelerator framework that is implementing the specific [14:32.240 --> 14:34.600] function. [14:34.600 --> 14:40.960] So this, as I said, this allows us to have the same application running either in the [14:40.960 --> 14:50.080] host, either in the VM without any changes. So it's easy to debug, easy to execute. And [14:50.080 --> 14:56.440] we can also access different kind of hardware, different kind of frameworks that exist. And [14:56.440 --> 15:00.360] we don't need to change their application. We can simply change the configuration in [15:00.360 --> 15:02.520] the host. [15:02.520 --> 15:08.200] So yes, we have another acceleration framework and maybe we can think that this is not going [15:08.200 --> 15:15.120] to be easy to use. But let's take an example and see how we can extend the VXL and see [15:15.120 --> 15:20.280] if it is easier or not. So let's get a typical vector addition example in OpenCL which can [15:20.280 --> 15:26.920] be executed in the CPU or in FPGA. And the steps that usually happen is that we set up [15:26.920 --> 15:33.200] the bitstream in the FPGA and the FPGA starts the reconfiguration. Of course, we transfer [15:33.200 --> 15:40.000] the data to the FPGA. Then we invoke the kernel as soon as it's ready and we also get the [15:40.000 --> 15:45.600] results back to the host. So this is what the application is already doing. So if you [15:45.600 --> 15:49.520] have this application already running in your machine, the only thing that you have to do [15:49.520 --> 15:56.280] is that somehow you need to libyify the application. And that's instead of just exposing an API [15:56.280 --> 15:58.760] to do that. [15:58.760 --> 16:05.080] And the next thing is that you can integrate the library in the VXL as a plug-in. And we [16:05.080 --> 16:10.960] have a very simplistic API that you can use and therefore the application will be seen [16:10.960 --> 16:18.400] as a plug-in for the VXL. Later, you can also update VXL, just adding one more API to the [16:18.400 --> 16:25.480] VXLRT so the application can directly use it with the correct parameters, of course. [16:25.480 --> 16:40.120] So I will give you a sort of demo of how this works, using Unicrout specifically. I will [16:40.120 --> 16:44.240] transfer a bit so we can have maybe in the most classification at first and then we can [16:44.240 --> 16:51.120] see how this, how a BLAST CUDA operation can be executed in the CPU and the GPU without [16:51.120 --> 17:02.040] any changes. And maybe some FPGA if we have time. Okay, this is not good. This is better. [17:02.040 --> 17:17.120] So we are in a typical working environment for Unicraft. We have created our application. [17:17.120 --> 17:23.920] We have a new lib which we are not going to use actually. And we have also Unicraft. [17:23.920 --> 17:30.720] So let's go to here. So this is a repo that we have created. I will show it to you later. [17:30.720 --> 17:43.040] So this is, I want to show you. So here you can see that we only have, we only exposed [17:43.040 --> 17:48.160] nine PFS and we use it because we want to transfer the data inside the Unicrout. So [17:48.160 --> 17:53.680] we are not going to use any network. We are just going to share a directory with the VM. [17:53.680 --> 17:58.520] And the only need that you need to do is to select VXLRT and that's all. As you see, [17:58.520 --> 18:06.400] we don't have any libc because we don't need it for the specific example. So these are [18:06.400 --> 18:12.680] all the applications that are currently running in Unicraft. You can try them out by yourself. [18:12.680 --> 18:29.560] So let's, we're going to use image classification. So we'll take some time, let me, we'll take [18:29.560 --> 18:35.760] some time to build. But I will also try to show you how the application looks like as [18:35.760 --> 18:46.200] soon as it finishes. And it should finish right now, almost. Okay. And that's application. [18:46.200 --> 18:52.520] So as you can see, yeah, we can skip the reading of the file. So this application is quite [18:52.520 --> 18:57.800] simple. Like we have a session that we have to create with VXL with the host. Then we [18:57.800 --> 19:06.400] simply call the, this is the function that is called VXL image classification. It has [19:06.400 --> 19:15.800] the arguments that are also needed. And then we simply release the resources that we have [19:15.800 --> 19:25.600] used. So I will try to do an image classification for this beautiful hedgehog that we have [19:25.600 --> 19:35.480] here. And let's see what's going to happen. Okay. So all these logs that you see here [19:35.480 --> 19:48.200] are from the Deton inference plugin. And we see that we have a hedgehog. So it was identified. [19:48.200 --> 19:55.200] And the thing here is to, you can see that all of these logs are not from the Unicraft. [19:55.200 --> 20:04.040] All of these logs are from the host that is running. I can also show you this small demo [20:04.040 --> 20:17.680] with some operations for arrays using CUDA. So the same here. We're just, we're going [20:17.680 --> 20:23.440] to export the backend. First, we're going to use a no op plugin, which simply doesn't [20:23.440 --> 20:32.640] do anything. You can mostly maybe use the only 40 bug. So we have here the application, [20:32.640 --> 20:40.240] which is a SKM. And you can see that it doesn't do anything because it's just a no op plugin. [20:40.240 --> 20:47.120] It doesn't do anything special. So we can change the configuration in the host and specify [20:47.120 --> 20:58.760] that the backend that we want to use is the actual CUDA implementation for maybe CPU. Yes. [20:58.760 --> 21:05.800] Okay. So then we will run it and you will see that we have the, actually it's a min [21:05.800 --> 21:13.600] max operation. It's not a SKM. And then you can also, we will also run the same thing [21:13.600 --> 21:21.560] in a GPU. Again, we are just in the host again. We can simply change the configuration [21:21.560 --> 21:31.480] and now we start it again, the Unicernal, and we get the result from the GPU. You can [21:31.480 --> 21:44.400] also, all these debug messages, you can remove them of course. So we also have the, yes, [21:44.400 --> 21:53.400] this is also min max still. Now we will go to SKM. Do we have time still? Yeah. Okay. [21:53.400 --> 22:02.320] So yeah, we can just use this. Again, no op, nothing happens. Nothing really special. We [22:02.320 --> 22:10.240] will do the export for, to specify the CPU plugin again. And we will execute and we'll [22:10.240 --> 22:21.280] see that the execution time, it's quite not very big, but it's just remember that number. [22:21.280 --> 22:28.360] And now we will run it in the GPU and you can see here that the execution time is much [22:28.360 --> 22:48.600] better than before. And that's all. We can also solve the, the, the FPGA, which is, okay. [22:48.600 --> 22:55.640] So this is an FPGA, right? So we need to have a bit stream. And this is a black skulls application [22:55.640 --> 23:01.200] by the way. And we will run it natively in the beginning and then we will also run it [23:01.200 --> 23:10.440] in the Unicraft. So first we just run the application natively and you can see all of [23:10.440 --> 23:20.200] the logs and everything of the execution in the FPGA. And then we can, we will see how [23:20.200 --> 23:33.560] this is executed in a Unicernel. So this is, I forgot to solve that, but I will, so it [23:33.560 --> 23:37.840] will explain later what are all of these things. Usually what we have to do is just [23:37.840 --> 23:44.000] to export the VXL backend that we want to use. That's how we configure the host to use [23:44.000 --> 23:49.240] a specific plugin. And then we have the chemo command that I can explain in more details [23:49.240 --> 23:58.800] after this video. Still, this is from the Unicernel now and we access the FPGA and we have the [23:58.800 --> 24:06.880] black skulls operation running there. And we also have one more FPGA application, but [24:06.880 --> 24:15.200] I think you got the point. We have all these links for the videos and everything in our [24:15.200 --> 24:25.120] talk in Fosden. So you can also see them from there. Let me talk a bit about chemo, the [24:25.120 --> 24:32.160] chemo plugin that we have. This is a bit more, this is just from our Apple. So here we need [24:32.160 --> 24:43.040] the chemo which has the vertio backend for VXL. And if Unicraft, for example, had support [24:43.040 --> 24:48.040] for Vsoc, we didn't have to use the vertio backend, so we didn't have to modify chemo. [24:48.040 --> 24:58.320] But since we have no Vsoc support, then we have to use the vertio, and therefore we changed [24:58.320 --> 25:08.040] a bit chemo with adding the backend, as you can see here. And these are all the, you already [25:08.040 --> 25:15.600] know from the previous talk, all the configurations for Unicraft, the command line options. I [25:15.600 --> 25:27.240] will also show you our docs. We have here an extended documentation. You can find how [25:27.240 --> 25:35.160] to run VXL application in VM, how to run it remotely. We also have it, it doesn't show [25:35.160 --> 25:50.640] here, but we also have... Okay. Maybe more. Okay, so here we also have all the things [25:50.640 --> 25:59.000] that you need to do to try it out by yourself in Unicraft. And all of them are open source. [25:59.000 --> 26:09.040] You can check them out, and you can clone them by yourself. So let me return. So currently [26:09.040 --> 26:18.960] VXL has bindings for... We actually released the version 0.5, and currently there is bind... [26:18.960 --> 26:29.200] We have language bindings for C, C++, Python, Rust, and also for TensorFlow. And we have [26:29.200 --> 26:37.600] the plugin API that I talked before about extending VXL. You can also see how it is. [26:37.600 --> 26:43.720] These are all the things that we have tested and we support right now. So from the hypervisor [26:43.720 --> 26:50.760] perspective, we have support for Chemo over Ritio and Vsoc. And for these new Rust VMMs, [26:50.760 --> 27:02.240] like Firecracker, Cloud Hypervisor, and Dragon Ball. Regarding Unicernals, we have working... [27:02.240 --> 27:07.880] It's currently working in Unicraft and in Rembrandt, but we want to also port it in OSV [27:07.880 --> 27:14.480] and maybe some more Unicernal frameworks. And we also have integration with Kubernetes, [27:14.480 --> 27:23.000] Cata containers, and OpenFuzz for serverless deployment. And these are all the acceleration [27:23.000 --> 27:28.080] frameworks that we have tested and to work with VXL. So this is an inference that you [27:28.080 --> 27:34.920] saw that we did the immense classification. We have TensorFlow and PyTorch support, TensorFlow [27:34.920 --> 27:42.280] 13, OpenVino, OpenCylo, CUDA that you saw with the other demo. And regarding hardware, [27:42.280 --> 27:55.160] we have tested with GPUs, edge devices like Coral, and also FPGAs. So to sum up, hardware [27:55.160 --> 28:01.440] accelerations are... The software stack of hardware accelerators are huge and complicated [28:01.440 --> 28:10.880] to be ported easily in Unicernals. And we have VXL which is able to abstract the heterogeneity [28:10.880 --> 28:17.720] both in the hardware and in the software. And it sounds like a perfect fit for Unicernals. [28:17.720 --> 28:23.920] So if you want, you can try it out by yourselves. Here are all the links that you can use and [28:23.920 --> 28:33.200] test them out. And we would like to mention that this work is partially funded from two [28:33.200 --> 28:40.120] Horizon projects, Ceraan and 5G Complete. And we would also like to invite you in the [28:40.120 --> 28:46.520] Unicraft hackathon that will take place in Athens at the end of March. And thank you [28:46.520 --> 28:56.800] for your attention. If you have any questions, we will be happy to answer them. [28:56.800 --> 29:00.360] Thank you so much, Babi. So for the third time, we'll welcome you in Athens in late March [29:00.360 --> 29:10.120] for the hackathon. If there are any questions from the audience? Yeah, please. Thank you. [29:10.120 --> 29:18.800] Great stuff. I have a question about the potential future and the performance that we are currently [29:18.800 --> 29:24.880] maybe possibly losing to the usage of API and transport. What do you think is a potential [29:24.880 --> 29:30.040] in more increase of performance given that framework? [29:30.040 --> 29:36.160] Yeah, actually, the transport is actually, yes, it's bottleneck since you have all these [29:36.160 --> 29:49.040] transfers that take place. But we think that at the end, we will have still very good execution [29:49.040 --> 29:57.440] times, very good performance. And it's also important to mention that we can also set [29:57.440 --> 30:02.200] up the environment and everything so you can minimize the transfers. For example, you can [30:02.200 --> 30:08.600] have your model. If you have a TensorFlow model or anything, we are working on how it [30:08.600 --> 30:13.720] can be done and prefetching it before you deploy the function in the host and having [30:13.720 --> 30:18.120] everything there so you don't have to transfer from the VM to the host and vice versa and [30:18.120 --> 30:23.480] all of these things. Actually, if I may intervene, so these are [30:23.480 --> 30:30.360] two issues. The first issue is all the resources, the models, the out-of-band stuff that you [30:30.360 --> 30:39.360] can do in a separate API, in a cloud environment, in a serverless deployment. And the second [30:39.360 --> 30:46.640] thing about the actual transfers for Virtio or Visoc, the thing is that since we semantically [30:46.640 --> 30:53.240] abstract the whole operation, you don't have to do a CUDA, MIMCOPY, CUDA malloc, CUDA something, [30:53.240 --> 30:59.680] set kernel, whatever, and you don't have this latency in the transfer. So it minimizes [30:59.680 --> 31:06.280] the overhead just to the part of copying the data across, so the actual data, the input [31:06.280 --> 31:13.920] data and the output. So this is really, really minimal. So in VMs that we have tested, we [31:13.920 --> 31:19.360] have tested remotely, but the network is not that good, so we need to do more tests there. [31:19.360 --> 31:26.160] But in VMs that we have tested, the overhead is less than 5%. For an image classification [31:26.160 --> 31:34.800] of 32K to a MEG, something like that. So it's really, really small, the overhead for the [31:34.800 --> 31:40.240] transport layer, both Virtio and Visoc. The Visoc part is a bit more because it serializes [31:40.240 --> 31:45.800] the stuff through protobufs and the stack is a bit complicated, but the Virtio stuff [31:45.800 --> 31:49.240] is really super efficient. [31:49.240 --> 31:58.560] Hi, so thank you for the talk. My question would be kind of almost on the same thing, [31:58.560 --> 32:03.920] but from the security perspective. So if we kind of offload a lot of computation out of [32:03.920 --> 32:11.560] the Unicernel to the host again, I guess security, at least the isolation is a thing [32:11.560 --> 32:15.480] to think about. So if you, any words on this topic? [32:15.480 --> 32:20.360] Yeah, you can take it. It's yours. [32:20.360 --> 32:30.360] Okay, we agree. Yes, there are issues with security because essentially you need to run [32:30.360 --> 32:38.560] on Unicernel to be isolated, and now we push the execution to the host. So one of the things [32:38.560 --> 32:45.560] that we have thought about is that when you run that on a cloud environment, the vendor [32:45.560 --> 32:52.760] should make sure that whatever application is supported to be run on the host should [32:52.760 --> 32:59.240] be secure, should be audited. So the user doesn't have all the possibilities available. [32:59.240 --> 33:04.680] They cannot just exec something in the host. They will be able to exec specific stuff that [33:04.680 --> 33:14.280] are audited in libraries in the plug-in system. So one approach is this. Another response [33:14.280 --> 33:24.320] to the security implications is that at the moment you have no opportunity to run from [33:24.320 --> 33:33.360] a Unicernel hardware accelerated workload. So if you want to be able to deploy such an [33:33.360 --> 33:46.000] application somewhere, then you can run isolated. You can use the whole hardware accelerator [33:46.000 --> 33:51.640] and have the same binary that you would deploy in a non-secure environment. So you could [33:51.640 --> 34:02.440] secure the environment, but have this compatibility and software supply mode using a Unicernel, [34:02.440 --> 34:09.440] using this semantic abstraction, let's see. [34:09.440 --> 34:16.440] Any other question? Yeah. Please. [34:16.440 --> 34:22.440] So my question is similar to the first question, but I'm wondering, because you can also do [34:22.440 --> 34:33.600] GPU pass-through via KVM and just pass the GPU to a virtual machine. So I'm wondering [34:33.600 --> 34:39.920] what is the performance difference between doing that and doing it in VR? [34:39.920 --> 34:45.000] Yes. Actually, we want to evaluate that, and we need to evaluate it and see how, for example, [34:45.000 --> 34:50.560] with the even pass-through directly, like exposing the whole GPU to the VM, this could [34:50.560 --> 35:00.640] be also one baseline for the valuation. Currently, I don't remember if we do have any measurements [35:00.640 --> 35:01.640] already. [35:01.640 --> 35:05.560] Would you consider the pass-through case the same as made? [35:05.560 --> 35:23.800] Yeah, but I mean, if we have any, like, okay. Actually, from GPU virtualization, for example, [35:23.800 --> 35:31.720] I'm not sure how many VMs can be supported in one single GPU, for example. I'm not aware [35:31.720 --> 35:42.320] of any solution that can scale to, like, tens of VMs, even tens of VMs. I'm not sure if [35:42.320 --> 35:50.360] there is any existing solution for that. But, yes, we plan it. We want to do some extended [35:50.360 --> 35:59.280] evaluation compared also to some, like, let's say, virtual GPU that exists or even the pass-through [35:59.280 --> 36:07.080] and native execution. We want to do that, and hopefully, we can also publish the results [36:07.080 --> 36:09.080] in our block. [36:09.080 --> 36:11.600] Okay. Thank you. [36:11.600 --> 36:14.720] Any other questions? Yeah. [36:14.720 --> 36:26.160] So, in response to the first security question about, yeah, we are offloading now compute [36:26.160 --> 36:32.520] to the hypervisor and host. So, does it imply that there is a possibility to break out of [36:32.520 --> 36:36.480] the containerization with BXL? [36:36.480 --> 36:51.280] Well, there's, yes, yes, code is going to be executing on the host in a privileged level. [36:51.280 --> 37:04.880] Yes. But the other option is what? So, yeah, there is a trade. [37:04.880 --> 37:10.320] We are actually working. We want to see what available sources we have there. How can we [37:10.320 --> 37:16.320] make it more secure? How we can sandbox it somehow to make it look better? But on the [37:16.320 --> 37:22.840] other hand, like, for example, in FPTAs, there's no MMU, there's nothing. If you run two kernels, [37:22.840 --> 37:27.200] one kernel can access, if you kind of know what to do, one kernel can access all the [37:27.200 --> 37:31.640] memory in the whole FPTAs, for example. So, in one hand, you also need support from the [37:31.640 --> 37:37.560] hardware. And regarding, for example, the software stack, we are looking at it and see [37:37.560 --> 37:49.880] how this can, how can we extend and make it more, at least, increase the difficulty for [37:49.880 --> 37:50.880] having any. [37:50.880 --> 37:59.840] So, for example, in the Cata containers integration that we have, so when you spawn a container, [37:59.840 --> 38:09.200] you sandbox the container in a VM, our agent, the host part of the Excel is running on the [38:09.200 --> 38:17.160] same sandbox, not in the VM, outside the VM. But it runs in the sandbox. So, yes, there [38:17.160 --> 38:27.360] is code executing on the host, but it's in the sandbox. So, it's kind of a tradeoff. [38:27.360 --> 38:35.440] Anything else? Right? If not, thank you, Anastasia. Thank you, Babis. Yeah.