[00:00.000 --> 00:13.000]  The next session will be by Damien about SDK with Zen, so enjoy the session.
[00:13.000 --> 00:16.000]  My name is Damien.
[00:16.000 --> 00:22.000]  I'm looking at practice and doing a presentation as the French university comes to guard.
[00:22.000 --> 00:32.000]  I'm interested in the presentation which is a software based in Grenoble.
[00:32.000 --> 00:38.000]  So VATES is the one most popular company behind this project,
[00:38.000 --> 00:55.000]  and we want to do better with what we have because of some problems that we approach.
[00:55.000 --> 01:01.000]  So my focus is on the storage visualization part of the supervisor,
[01:01.000 --> 01:09.000]  and we currently have a very programmatic approach to the storage of the platform.
[01:09.000 --> 01:16.000]  So if you can see on the left that we have a bare metal NVU storage,
[01:16.000 --> 01:19.000]  and when we add the visualization stack from inside of the end,
[01:19.000 --> 01:26.000]  we have 300% of the power of the NVU inside.
[01:26.000 --> 01:32.000]  So what we want to do is to keep the storage of the security,
[01:32.000 --> 01:37.000]  given by the ten balizer, given the stack style.
[01:37.000 --> 01:50.000]  We want to have less impact as possible in the user.
[01:50.000 --> 01:53.000]  So we want to minimize the impact in users,
[01:53.000 --> 01:57.000]  so we want them to have to utilize things too much in the virtual machine
[01:57.000 --> 02:00.000]  or too much in private.
[02:00.000 --> 02:05.000]  So what I'm proposing is to use...
[02:05.000 --> 02:09.000]  So first I want to introduce the problem with Xamarin.
[02:09.000 --> 02:11.000]  Xamarin is a type when it arises,
[02:11.000 --> 02:18.000]  and it is stored as the storage that runs first on the hardware.
[02:18.000 --> 02:24.000]  And to be able to use rather a storage platform and network,
[02:24.000 --> 02:28.000]  we have to release something which is no special,
[02:28.000 --> 02:32.000]  not a virtual machine, a Linux.
[02:32.000 --> 02:35.000]  That is running second.
[02:35.000 --> 02:41.000]  Then it's initializing CPU, then running a virtual machine.
[02:41.000 --> 02:47.000]  This is given, then control on the storage platform and the network.
[02:47.000 --> 02:49.000]  So this VM is very important,
[02:49.000 --> 02:58.000]  because it has the responsibility of sharing a network and storage for other VMs.
[02:58.000 --> 03:02.000]  So what I'm proposing is...
[03:02.000 --> 03:05.000]  I'm focusing on NVMe.
[03:05.000 --> 03:08.000]  Well, it's in the case, for example, in NVMe.
[03:08.000 --> 03:12.000]  And we want to focus on the NVMe part.
[03:12.000 --> 03:17.000]  So NVMe is a new storage protocol for you.
[03:17.000 --> 03:20.000]  It's been around for a few years. It's everywhere now.
[03:20.000 --> 03:27.000]  So that is giving us a lot of better preferences inside of it.
[03:27.000 --> 03:34.000]  So I have to introduce some things of concern.
[03:34.000 --> 03:37.000]  So we have all done zero virtual machines,
[03:37.000 --> 03:44.000]  but this can share the storage for other virtual machines.
[03:44.000 --> 03:48.000]  And to do that, we have what we call a speed driver.
[03:48.000 --> 03:54.000]  So we have a specialized protocol from VLJ for the same case.
[03:54.000 --> 04:02.000]  That is used to transmit block storage requests from the VM to our back end, running on zero.
[04:02.000 --> 04:09.000]  So most of the time, it's, for example, a new data from the driver,
[04:09.000 --> 04:15.000]  that is presenting itself as a block driver in the distributed machine.
[04:15.000 --> 04:19.000]  So it's already in place since a long time,
[04:19.000 --> 04:23.000]  so it's obscuring the lay-downs, so it's not...
[04:23.000 --> 04:26.000]  So it's been running also since 2015.
[04:26.000 --> 04:29.000]  And the back end, currently in the XP engine,
[04:29.000 --> 04:34.000]  back end status, that is, a new space program
[04:34.000 --> 04:37.000]  that takes multiple requests,
[04:37.000 --> 04:44.000]  and it's via IO to transfer them to the VM zero block area.
[04:44.000 --> 04:49.000]  In that, we have a special interface to share memory between the VM,
[04:49.000 --> 04:53.000]  that is used in the VLJ protocol,
[04:53.000 --> 05:00.000]  where block request from the return machines to our VM zero.
[05:00.000 --> 05:05.000]  This interface is mediated by the provider,
[05:05.000 --> 05:07.000]  so it's then in place.
[05:07.000 --> 05:15.000]  We have the return machine that we usually call the VM.
[05:15.000 --> 05:23.000]  We're basically telling the parallelizers that the memory will be granted access to, by, to VM zero,
[05:23.000 --> 05:27.000]  but it's in the return machine.
[05:27.000 --> 05:33.000]  And the VM zero need to ask the parallelizer to map its system memory into its own memory,
[05:33.000 --> 05:35.000]  to be able to access it.
[05:35.000 --> 05:38.000]  So I have already replaced that disk in our limitation,
[05:38.000 --> 05:42.000]  because we want to use the screen care in place of that disk,
[05:42.000 --> 05:45.000]  and we want to directly take care of requests,
[05:45.000 --> 05:48.000]  and transmit it to the VM.
[05:48.000 --> 05:52.000]  So to talk to you about more about the screen care,
[05:52.000 --> 05:55.000]  the screen care is the storage performance that we're looking at,
[05:55.000 --> 05:57.000]  but in the world you may be created by install,
[05:57.000 --> 06:01.000]  but it's used by your device maintenance.
[06:01.000 --> 06:05.000]  It is essentially a driver for the admin using devices,
[06:05.000 --> 06:12.000]  and it will be on the user space, in our case, in terms of work,
[06:12.000 --> 06:16.000]  and it's used in Linux, and in RedMetal also.
[06:16.000 --> 06:20.000]  So it's part of the same project as the 50k,
[06:20.000 --> 06:22.000]  but it's a great feature.
[06:22.000 --> 06:24.000]  Intercary on two?
[06:24.000 --> 06:26.000]  Yeah, that's a problem to provide.
[06:26.000 --> 06:29.000]  Okay, awesome, thank you.
[06:29.000 --> 06:36.000]  So here we have the current state.
[06:59.000 --> 07:25.000]  So we have two times the block layers that need to be traversed for one request.
[07:25.000 --> 07:33.000]  It's one of the costs that is adding to the difference we have with RedMetal.
[07:33.000 --> 07:39.000]  So our proposal is to use SPDK to directly transmit BLKV requests
[07:39.000 --> 07:46.000]  from a virtual machine to a storage platform, a storage device,
[07:46.000 --> 07:56.000]  and to reduce the cost by biasing the kernel in Dom0.
[07:56.000 --> 07:59.000]  It's completely transparent for the virtual machine
[07:59.000 --> 08:06.000]  because we reuse a lot of the infrastructure already present.
[08:06.000 --> 08:16.000]  So to take a BLKV request, it's a simple structure in a shell memory,
[08:16.000 --> 08:21.000]  in a ring, it's not very much different from Viertaio in this aspect.
[08:21.000 --> 08:27.000]  And we have a multiple request type, so read, write, or discard everything.
[08:27.000 --> 08:32.000]  I'm going to focus on read, write because it's the minimum we need to put on
[08:32.000 --> 08:36.000]  to be able to take requests.
[08:36.000 --> 08:42.000]  And we just have to transfer them to a SPDK interface.
[08:42.000 --> 08:49.000]  So SPDK needs to use a special memory to be able to transmit it to the device.
[08:49.000 --> 08:54.000]  So we need to use the SPDK allocator, memory allocator,
[08:54.000 --> 09:00.000]  to be able to have a buffer that will be used to go from the device.
[09:00.000 --> 09:08.000]  So we need to have the data copied from the virtual machine to our Dom0.
[09:08.000 --> 09:11.000]  Then we can transfer it to the disk.
[09:11.000 --> 09:13.000]  So it's pretty simple. We allocate memory.
[09:13.000 --> 09:19.000]  We copy the data using the Grants table interface into our memory.
[09:19.000 --> 09:21.000]  Then we just write it on the disk.
[09:21.000 --> 09:26.000]  SPDK will call a callback that we've given it when it is finished
[09:26.000 --> 09:31.000]  and then we can do the same for the read requests.
[09:31.000 --> 09:37.000]  So it's working great for now.
[09:37.000 --> 09:42.000]  So as you can see, the read request, the first and the second column are not very great,
[09:42.000 --> 09:45.000]  is because the implementation is not finished.
[09:45.000 --> 09:53.000]  And in this case, I'm doing more grant calls to the supervisors than on the write request.
[09:53.000 --> 09:57.000]  So it's a big cost of our implementation.
[09:57.000 --> 10:04.000]  But for now, it's done and we'll look into improving it.
[10:04.000 --> 10:06.000]  But we are doing better than that.
[10:06.000 --> 10:07.000]  This is on the right.
[10:07.000 --> 10:11.000]  The blue column is that this is the grant status.
[10:11.000 --> 10:16.000]  And our implementation is the red on the left.
[10:16.000 --> 10:21.000]  So same for block size and throughput.
[10:21.000 --> 10:33.000]  So we are able to improve the performance of our storage stack in a transparent manner for VM
[10:33.000 --> 10:41.000]  because they can use normal VMs at boot today on the tab disk in the current infrastructure of the stage
[10:41.000 --> 10:43.000]  and still make it work.
[10:43.000 --> 10:49.000]  The problem, we have to use NVMe dedicated to the SPDK platform.
[10:49.000 --> 10:56.000]  But NVMe is pretty much everywhere nowadays, even in data centers, especially in data centers.
[10:56.000 --> 11:04.000]  We are still capable of using the security of the grant table because we keep the state of where the VM only shares
[11:04.000 --> 11:11.000]  data that want to be written on the disk with the backend in SPDK.
[11:11.000 --> 11:18.000]  And then still doing the mediator for this.
[11:18.000 --> 11:26.000]  What we want to do is, of course, having the read request being better than the tab disk,
[11:26.000 --> 11:33.000]  which since we are in some case, for example here, at the same level, without this optimization,
[11:33.000 --> 11:35.000]  I'm not very worried about that.
[11:35.000 --> 11:44.000]  But we want to be able to not have to copy data from VM into DOM0 then,
[11:44.000 --> 11:52.000]  having it being handled by the NVMe and use directly the guest memory as source and destination
[11:52.000 --> 11:56.000]  for the DMR request from the NVMe drive.
[11:56.000 --> 12:06.000]  And we want to take a look at the grantable interface to see if it can be improved for modern days computing.
[12:06.000 --> 12:09.000]  So I'm finished.
[12:09.000 --> 12:18.000]  Thank you so much.
[12:18.000 --> 12:19.000]  Question?
[12:19.000 --> 12:21.000]  Yes.
[12:21.000 --> 12:27.000]  I know from operating OpenStack Cloud that DPDK is very hard to install and implement.
[12:27.000 --> 12:30.000]  How hard is it to implement SPDK?
[12:30.000 --> 12:39.000]  The first question is, very quickly, is your work somehow can be applied to KVM visualization tool?
[12:39.000 --> 12:48.000]  So the question is if SPDK is hard to use like DPDK and if it can be used in the KVM infrastructure.
[12:48.000 --> 12:55.000]  So it has always been used in the KVM infrastructure to be as a storage platform for Vietaio guests.
[12:55.000 --> 13:00.000]  It's just that our case is special because of the different architecture between KVM and them.
[13:00.000 --> 13:05.000]  So it's already done by the SPDK community in this way.
[13:05.000 --> 13:08.000]  I would say that this SPDK is not very hard to install.
[13:08.000 --> 13:15.000]  In our case, it would be given with the app advisor and the install of XPNG,
[13:15.000 --> 13:18.000]  but it's not very hard to install.
[13:18.000 --> 13:23.000]  We just have to have a special configuration for our DOM0
[13:23.000 --> 13:34.000]  because the SPDK is relying on a super page of 2Mhp to be able to do DMR requests.
[13:34.000 --> 13:48.000]  And so we have to have this support and it's not available in the basic configuration of DOM0.
[13:48.000 --> 13:49.000]  Yes?
[13:49.000 --> 13:55.000]  Does your implementation survive the crash of the SPDK process on those?
[13:55.000 --> 13:57.000]  Well, yes.
[13:57.000 --> 14:03.000]  The question is if the application is surviving the crash of the SPDK process.
[14:03.000 --> 14:08.000]  So the virtual machine would be able to survive the SPDK not being available.
[14:08.000 --> 14:13.000]  We would lose the disk in the virtual machine.
[14:13.000 --> 14:19.000]  Well, it would be, it would hang in the virtual machine for the disk,
[14:19.000 --> 14:30.000]  but the virtual machine would still be able to run with the problem.
[14:30.000 --> 14:36.000]  Any other questions?
[14:36.000 --> 14:38.000]  Thank you.
[14:38.000 --> 14:43.000]  Thank you again.