[00:00.000 --> 00:13.000] The next session will be by Damien about SDK with Zen, so enjoy the session. [00:13.000 --> 00:16.000] My name is Damien. [00:16.000 --> 00:22.000] I'm looking at practice and doing a presentation as the French university comes to guard. [00:22.000 --> 00:32.000] I'm interested in the presentation which is a software based in Grenoble. [00:32.000 --> 00:38.000] So VATES is the one most popular company behind this project, [00:38.000 --> 00:55.000] and we want to do better with what we have because of some problems that we approach. [00:55.000 --> 01:01.000] So my focus is on the storage visualization part of the supervisor, [01:01.000 --> 01:09.000] and we currently have a very programmatic approach to the storage of the platform. [01:09.000 --> 01:16.000] So if you can see on the left that we have a bare metal NVU storage, [01:16.000 --> 01:19.000] and when we add the visualization stack from inside of the end, [01:19.000 --> 01:26.000] we have 300% of the power of the NVU inside. [01:26.000 --> 01:32.000] So what we want to do is to keep the storage of the security, [01:32.000 --> 01:37.000] given by the ten balizer, given the stack style. [01:37.000 --> 01:50.000] We want to have less impact as possible in the user. [01:50.000 --> 01:53.000] So we want to minimize the impact in users, [01:53.000 --> 01:57.000] so we want them to have to utilize things too much in the virtual machine [01:57.000 --> 02:00.000] or too much in private. [02:00.000 --> 02:05.000] So what I'm proposing is to use... [02:05.000 --> 02:09.000] So first I want to introduce the problem with Xamarin. [02:09.000 --> 02:11.000] Xamarin is a type when it arises, [02:11.000 --> 02:18.000] and it is stored as the storage that runs first on the hardware. [02:18.000 --> 02:24.000] And to be able to use rather a storage platform and network, [02:24.000 --> 02:28.000] we have to release something which is no special, [02:28.000 --> 02:32.000] not a virtual machine, a Linux. [02:32.000 --> 02:35.000] That is running second. [02:35.000 --> 02:41.000] Then it's initializing CPU, then running a virtual machine. [02:41.000 --> 02:47.000] This is given, then control on the storage platform and the network. [02:47.000 --> 02:49.000] So this VM is very important, [02:49.000 --> 02:58.000] because it has the responsibility of sharing a network and storage for other VMs. [02:58.000 --> 03:02.000] So what I'm proposing is... [03:02.000 --> 03:05.000] I'm focusing on NVMe. [03:05.000 --> 03:08.000] Well, it's in the case, for example, in NVMe. [03:08.000 --> 03:12.000] And we want to focus on the NVMe part. [03:12.000 --> 03:17.000] So NVMe is a new storage protocol for you. [03:17.000 --> 03:20.000] It's been around for a few years. It's everywhere now. [03:20.000 --> 03:27.000] So that is giving us a lot of better preferences inside of it. [03:27.000 --> 03:34.000] So I have to introduce some things of concern. [03:34.000 --> 03:37.000] So we have all done zero virtual machines, [03:37.000 --> 03:44.000] but this can share the storage for other virtual machines. [03:44.000 --> 03:48.000] And to do that, we have what we call a speed driver. [03:48.000 --> 03:54.000] So we have a specialized protocol from VLJ for the same case. [03:54.000 --> 04:02.000] That is used to transmit block storage requests from the VM to our back end, running on zero. [04:02.000 --> 04:09.000] So most of the time, it's, for example, a new data from the driver, [04:09.000 --> 04:15.000] that is presenting itself as a block driver in the distributed machine. [04:15.000 --> 04:19.000] So it's already in place since a long time, [04:19.000 --> 04:23.000] so it's obscuring the lay-downs, so it's not... [04:23.000 --> 04:26.000] So it's been running also since 2015. [04:26.000 --> 04:29.000] And the back end, currently in the XP engine, [04:29.000 --> 04:34.000] back end status, that is, a new space program [04:34.000 --> 04:37.000] that takes multiple requests, [04:37.000 --> 04:44.000] and it's via IO to transfer them to the VM zero block area. [04:44.000 --> 04:49.000] In that, we have a special interface to share memory between the VM, [04:49.000 --> 04:53.000] that is used in the VLJ protocol, [04:53.000 --> 05:00.000] where block request from the return machines to our VM zero. [05:00.000 --> 05:05.000] This interface is mediated by the provider, [05:05.000 --> 05:07.000] so it's then in place. [05:07.000 --> 05:15.000] We have the return machine that we usually call the VM. [05:15.000 --> 05:23.000] We're basically telling the parallelizers that the memory will be granted access to, by, to VM zero, [05:23.000 --> 05:27.000] but it's in the return machine. [05:27.000 --> 05:33.000] And the VM zero need to ask the parallelizer to map its system memory into its own memory, [05:33.000 --> 05:35.000] to be able to access it. [05:35.000 --> 05:38.000] So I have already replaced that disk in our limitation, [05:38.000 --> 05:42.000] because we want to use the screen care in place of that disk, [05:42.000 --> 05:45.000] and we want to directly take care of requests, [05:45.000 --> 05:48.000] and transmit it to the VM. [05:48.000 --> 05:52.000] So to talk to you about more about the screen care, [05:52.000 --> 05:55.000] the screen care is the storage performance that we're looking at, [05:55.000 --> 05:57.000] but in the world you may be created by install, [05:57.000 --> 06:01.000] but it's used by your device maintenance. [06:01.000 --> 06:05.000] It is essentially a driver for the admin using devices, [06:05.000 --> 06:12.000] and it will be on the user space, in our case, in terms of work, [06:12.000 --> 06:16.000] and it's used in Linux, and in RedMetal also. [06:16.000 --> 06:20.000] So it's part of the same project as the 50k, [06:20.000 --> 06:22.000] but it's a great feature. [06:22.000 --> 06:24.000] Intercary on two? [06:24.000 --> 06:26.000] Yeah, that's a problem to provide. [06:26.000 --> 06:29.000] Okay, awesome, thank you. [06:29.000 --> 06:36.000] So here we have the current state. [06:59.000 --> 07:25.000] So we have two times the block layers that need to be traversed for one request. [07:25.000 --> 07:33.000] It's one of the costs that is adding to the difference we have with RedMetal. [07:33.000 --> 07:39.000] So our proposal is to use SPDK to directly transmit BLKV requests [07:39.000 --> 07:46.000] from a virtual machine to a storage platform, a storage device, [07:46.000 --> 07:56.000] and to reduce the cost by biasing the kernel in Dom0. [07:56.000 --> 07:59.000] It's completely transparent for the virtual machine [07:59.000 --> 08:06.000] because we reuse a lot of the infrastructure already present. [08:06.000 --> 08:16.000] So to take a BLKV request, it's a simple structure in a shell memory, [08:16.000 --> 08:21.000] in a ring, it's not very much different from Viertaio in this aspect. [08:21.000 --> 08:27.000] And we have a multiple request type, so read, write, or discard everything. [08:27.000 --> 08:32.000] I'm going to focus on read, write because it's the minimum we need to put on [08:32.000 --> 08:36.000] to be able to take requests. [08:36.000 --> 08:42.000] And we just have to transfer them to a SPDK interface. [08:42.000 --> 08:49.000] So SPDK needs to use a special memory to be able to transmit it to the device. [08:49.000 --> 08:54.000] So we need to use the SPDK allocator, memory allocator, [08:54.000 --> 09:00.000] to be able to have a buffer that will be used to go from the device. [09:00.000 --> 09:08.000] So we need to have the data copied from the virtual machine to our Dom0. [09:08.000 --> 09:11.000] Then we can transfer it to the disk. [09:11.000 --> 09:13.000] So it's pretty simple. We allocate memory. [09:13.000 --> 09:19.000] We copy the data using the Grants table interface into our memory. [09:19.000 --> 09:21.000] Then we just write it on the disk. [09:21.000 --> 09:26.000] SPDK will call a callback that we've given it when it is finished [09:26.000 --> 09:31.000] and then we can do the same for the read requests. [09:31.000 --> 09:37.000] So it's working great for now. [09:37.000 --> 09:42.000] So as you can see, the read request, the first and the second column are not very great, [09:42.000 --> 09:45.000] is because the implementation is not finished. [09:45.000 --> 09:53.000] And in this case, I'm doing more grant calls to the supervisors than on the write request. [09:53.000 --> 09:57.000] So it's a big cost of our implementation. [09:57.000 --> 10:04.000] But for now, it's done and we'll look into improving it. [10:04.000 --> 10:06.000] But we are doing better than that. [10:06.000 --> 10:07.000] This is on the right. [10:07.000 --> 10:11.000] The blue column is that this is the grant status. [10:11.000 --> 10:16.000] And our implementation is the red on the left. [10:16.000 --> 10:21.000] So same for block size and throughput. [10:21.000 --> 10:33.000] So we are able to improve the performance of our storage stack in a transparent manner for VM [10:33.000 --> 10:41.000] because they can use normal VMs at boot today on the tab disk in the current infrastructure of the stage [10:41.000 --> 10:43.000] and still make it work. [10:43.000 --> 10:49.000] The problem, we have to use NVMe dedicated to the SPDK platform. [10:49.000 --> 10:56.000] But NVMe is pretty much everywhere nowadays, even in data centers, especially in data centers. [10:56.000 --> 11:04.000] We are still capable of using the security of the grant table because we keep the state of where the VM only shares [11:04.000 --> 11:11.000] data that want to be written on the disk with the backend in SPDK. [11:11.000 --> 11:18.000] And then still doing the mediator for this. [11:18.000 --> 11:26.000] What we want to do is, of course, having the read request being better than the tab disk, [11:26.000 --> 11:33.000] which since we are in some case, for example here, at the same level, without this optimization, [11:33.000 --> 11:35.000] I'm not very worried about that. [11:35.000 --> 11:44.000] But we want to be able to not have to copy data from VM into DOM0 then, [11:44.000 --> 11:52.000] having it being handled by the NVMe and use directly the guest memory as source and destination [11:52.000 --> 11:56.000] for the DMR request from the NVMe drive. [11:56.000 --> 12:06.000] And we want to take a look at the grantable interface to see if it can be improved for modern days computing. [12:06.000 --> 12:09.000] So I'm finished. [12:09.000 --> 12:18.000] Thank you so much. [12:18.000 --> 12:19.000] Question? [12:19.000 --> 12:21.000] Yes. [12:21.000 --> 12:27.000] I know from operating OpenStack Cloud that DPDK is very hard to install and implement. [12:27.000 --> 12:30.000] How hard is it to implement SPDK? [12:30.000 --> 12:39.000] The first question is, very quickly, is your work somehow can be applied to KVM visualization tool? [12:39.000 --> 12:48.000] So the question is if SPDK is hard to use like DPDK and if it can be used in the KVM infrastructure. [12:48.000 --> 12:55.000] So it has always been used in the KVM infrastructure to be as a storage platform for Vietaio guests. [12:55.000 --> 13:00.000] It's just that our case is special because of the different architecture between KVM and them. [13:00.000 --> 13:05.000] So it's already done by the SPDK community in this way. [13:05.000 --> 13:08.000] I would say that this SPDK is not very hard to install. [13:08.000 --> 13:15.000] In our case, it would be given with the app advisor and the install of XPNG, [13:15.000 --> 13:18.000] but it's not very hard to install. [13:18.000 --> 13:23.000] We just have to have a special configuration for our DOM0 [13:23.000 --> 13:34.000] because the SPDK is relying on a super page of 2Mhp to be able to do DMR requests. [13:34.000 --> 13:48.000] And so we have to have this support and it's not available in the basic configuration of DOM0. [13:48.000 --> 13:49.000] Yes? [13:49.000 --> 13:55.000] Does your implementation survive the crash of the SPDK process on those? [13:55.000 --> 13:57.000] Well, yes. [13:57.000 --> 14:03.000] The question is if the application is surviving the crash of the SPDK process. [14:03.000 --> 14:08.000] So the virtual machine would be able to survive the SPDK not being available. [14:08.000 --> 14:13.000] We would lose the disk in the virtual machine. [14:13.000 --> 14:19.000] Well, it would be, it would hang in the virtual machine for the disk, [14:19.000 --> 14:30.000] but the virtual machine would still be able to run with the problem. [14:30.000 --> 14:36.000] Any other questions? [14:36.000 --> 14:38.000] Thank you. [14:38.000 --> 14:43.000] Thank you again.