[00:00.000 --> 00:09.680] Okay. So, hello, everyone. [00:10.680 --> 00:14.320] This presentation is about open CSD, [00:14.320 --> 00:17.280] which is a computational storage emulation platform. [00:17.280 --> 00:19.320] And the reason we're emulating that, [00:19.320 --> 00:20.760] I'll get into shortly. [00:20.760 --> 00:23.440] But first, I think I owe you an explanation of [00:23.440 --> 00:26.200] computational storage and what it actually is. [00:26.200 --> 00:28.640] Because I don't think many people are familiar with that. [00:28.640 --> 00:29.960] Even in this deaf room, [00:29.960 --> 00:33.320] but I'm pretty sure most people are familiar with Cameroon eBPF. [00:33.320 --> 00:35.920] You can email me. There's a link to the repo. [00:35.920 --> 00:41.440] And this has been a long time collaboration with my master's thesis at the food. [00:41.440 --> 00:43.480] So, let's get started. [00:43.480 --> 00:45.640] I'm going to briefly explain who am I. [00:45.640 --> 00:48.520] I'm Cornelucca. My handle online is mostly Dentalian. [00:48.520 --> 00:50.840] I'm also a licensed ham radio operator, [00:50.840 --> 00:53.080] Popa Delta 3 Sierra Uniform that is. [00:53.080 --> 00:56.520] And my expertise is in parallel and distributed system. [00:56.520 --> 00:59.200] I've been in academia for some while, associate degree, [00:59.200 --> 01:01.080] bachelor degree, master's degree. [01:01.080 --> 01:03.880] And I've had some experiences throughout that time. [01:03.880 --> 01:06.920] So, I've worked on health technology for officially impact people. [01:06.920 --> 01:09.840] Worked on OpenStack with cloud optimizations. [01:09.840 --> 01:12.960] I've done computational storage for my master's thesis. [01:12.960 --> 01:14.400] That's what this talk is about. [01:14.400 --> 01:15.640] And currently, we're working on [01:15.640 --> 01:19.520] SCADA systems for the lower two radio telescope at Astron. [01:19.520 --> 01:23.040] So, why do we actually need computational storage? [01:23.040 --> 01:24.320] And that's because we live in [01:24.320 --> 01:26.720] a data-driven society nowadays. [01:26.720 --> 01:29.400] So, the world is practically exploding with data, [01:29.400 --> 01:31.840] so much so that we're expected to store [01:31.840 --> 01:34.600] 200 setabytes of data by 2050. [01:34.600 --> 01:37.240] And these high data and throughput requirements [01:37.240 --> 01:39.120] pose significant challenges on [01:39.120 --> 01:42.640] storage interfaces and technologies that we are using today. [01:42.640 --> 01:47.080] So, if you look at your traditional computer architecture, [01:47.080 --> 01:49.600] the one that's being used on X86, [01:49.600 --> 01:51.920] it's based on the von Neumann architecture. [01:51.920 --> 01:54.520] And here, we basically need to move all data [01:54.520 --> 01:58.040] into main system memory before we can begin processing. [01:58.040 --> 02:00.240] So, this poses memory bottlenecks and [02:00.240 --> 02:02.400] internet interconnect bottlenecks on [02:02.400 --> 02:04.400] networks or PCI Express, [02:04.400 --> 02:06.400] and it also drastically [02:06.400 --> 02:09.240] ad hinders energy efficiency to an extent. [02:09.240 --> 02:13.120] So, how much of a bandwidth gap are we talking here? [02:13.120 --> 02:16.400] Well, if you look at the server from 2021, [02:16.400 --> 02:19.840] say using Epic Milan with 64 SSDs, [02:19.840 --> 02:22.840] we're losing about four and a half times the amount of bandwidth [02:22.840 --> 02:25.000] that could be offered by all the SSDs in tandem, [02:25.000 --> 02:26.520] but can't be utilized because we [02:26.520 --> 02:28.680] can't move it into memory that fast. [02:28.680 --> 02:31.120] So, that's quite significant. [02:31.120 --> 02:33.200] Now, what is this computational sort? [02:33.200 --> 02:34.840] And how does this solve this actually? [02:34.840 --> 02:38.520] Well, we fit a computational storage device, [02:38.520 --> 02:41.480] so a flash storage device with its own CPU and memory. [02:41.480 --> 02:44.040] And now, the user, the host processor, [02:44.040 --> 02:47.520] can submit small programs to this computational device, [02:47.520 --> 02:50.360] let it execute, and only the result data from [02:50.360 --> 02:52.040] this computation can then be returned [02:52.040 --> 02:54.200] over the interconnect into system memory, [02:54.200 --> 02:56.120] thereby reducing data movement [02:56.120 --> 02:58.640] and potentially improving energy efficiency. [02:58.640 --> 03:01.440] Because these lower power cores using [03:01.440 --> 03:03.480] more specialized hardware are typically [03:03.480 --> 03:04.600] more energy efficient than [03:04.600 --> 03:07.800] your general purpose x86 processor. [03:07.800 --> 03:09.600] If we then look at the state of [03:09.600 --> 03:12.000] current prototypes as of September 2022, [03:12.000 --> 03:14.440] we see three main impediments. [03:14.440 --> 03:16.280] Firstly, is the API between [03:16.280 --> 03:17.600] the host and device interface. [03:17.600 --> 03:19.360] There's no standardization here. [03:19.360 --> 03:21.480] People aren't building hardware prototypes, [03:21.480 --> 03:24.160] but not so much looking at the software interfaces. [03:24.160 --> 03:26.400] And we also have the problem of a file system, [03:26.400 --> 03:27.680] because these flash devices, [03:27.680 --> 03:29.240] they're your file systems and we want to [03:29.240 --> 03:31.720] keep that synchronized between the host and device. [03:31.720 --> 03:33.360] So, how do we achieve that? [03:33.360 --> 03:35.480] We can't use cache coherent interconnects [03:35.480 --> 03:37.800] or shared virtual memory because by the time [03:37.800 --> 03:39.280] we back roundtrip between [03:39.280 --> 03:40.800] the PCI Express interface, [03:40.800 --> 03:42.160] we'll have lost all the performance [03:42.160 --> 03:43.760] that we decide to gain. [03:43.760 --> 03:47.160] And how do we stick to existing interfaces? [03:47.160 --> 03:48.920] People that access file systems, [03:48.920 --> 03:50.800] they read, they write, they use system calls. [03:50.800 --> 03:52.680] They are very used to this. [03:52.680 --> 03:54.520] If you would suddenly need to link [03:54.520 --> 03:56.880] a shared library to access your file system, [03:56.880 --> 03:58.440] people wouldn't be up for that. [03:58.440 --> 04:00.840] So, we need some solutions here. [04:00.840 --> 04:04.240] That's what OpenCSD and FluffleFS introduce. [04:04.240 --> 04:06.600] We have a simple and intuitive system. [04:06.600 --> 04:08.600] All the dependencies and the software itself [04:08.600 --> 04:10.280] can run in user space. [04:10.280 --> 04:12.680] You don't need any kernel modules or things like this. [04:12.680 --> 04:14.080] We manage you entirely. [04:14.080 --> 04:16.520] We use system calls that are [04:16.520 --> 04:18.560] available in all operating systems, [04:18.560 --> 04:21.280] nor all most typical operating systems, [04:21.280 --> 04:24.560] FreeBSD, Windows, Mac OS, and Linux. [04:24.560 --> 04:26.120] So, I'd say that's pretty good. [04:26.120 --> 04:28.440] And we do something that's never been done [04:28.440 --> 04:30.240] before in computational storage. [04:30.240 --> 04:34.040] We allow a regular user on the host to access a file. [04:34.040 --> 04:36.280] Concurrently, while a kernel that is [04:36.280 --> 04:38.640] executing on the computational storage device [04:38.640 --> 04:40.080] is also accessing that file. [04:40.080 --> 04:42.120] And this has never been done before. [04:42.120 --> 04:44.760] And we managed to do this using [04:44.760 --> 04:46.960] existing open-source libraries. [04:46.960 --> 04:50.360] So, we've boost, Scenium, Fuse, UBPF, and SBDK. [04:50.360 --> 04:53.560] Some of you will be familiar with some of these. [04:54.440 --> 04:57.680] And this allows any user like you to, after this talk, [04:57.680 --> 04:59.920] try and experience this yourself in Camu [04:59.920 --> 05:02.240] without buying any additional hardware. [05:02.240 --> 05:04.160] And I'll get into that hardware in a second, [05:04.160 --> 05:05.680] because there's some specialized hardware [05:05.680 --> 05:08.080] that if we want to have this physically in our hands, [05:08.080 --> 05:09.400] we have to do some things. [05:09.400 --> 05:11.280] And if we look at the design, [05:11.280 --> 05:13.280] then we see four key components [05:13.280 --> 05:15.840] and a fifth one that they'll explain on the next slide. [05:15.840 --> 05:18.400] We're using a log-structured file system [05:18.400 --> 05:20.400] which supports no in-place updates. [05:20.400 --> 05:22.400] So, everything is appended and appended. [05:23.400 --> 05:25.400] And we have a module interface [05:25.400 --> 05:27.400] where we have backends and frontends. [05:27.400 --> 05:30.400] So, this allows us to experiment and try out new things. [05:30.400 --> 05:32.400] We can basically swap the backends [05:32.400 --> 05:34.400] and keep the frontend the same. [05:34.400 --> 05:37.400] And we're using this new technology in Flash SSDs [05:37.400 --> 05:39.400] that's called zone namespaces. [05:39.400 --> 05:41.400] They are commercially available now, [05:41.400 --> 05:43.400] but they're pretty hard to get still, [05:43.400 --> 05:46.400] but that's going to improve in the future. [05:46.400 --> 05:50.400] And the system calls that we managed to reuse, [05:50.400 --> 05:52.400] those are extended attributes. [05:52.400 --> 05:55.400] So, extended attributes on any file and directory [05:55.400 --> 05:57.400] on most file systems, [05:57.400 --> 06:00.400] on the file system you are using likely now, [06:00.400 --> 06:04.400] you can set arbitrary key value pairs on these files. [06:04.400 --> 06:06.400] And we can use this as a hint [06:06.400 --> 06:08.400] from the user to the file system [06:08.400 --> 06:10.400] to instruct the file system [06:10.400 --> 06:12.400] that something special needs to happen. [06:12.400 --> 06:14.400] And basically, we just reserve some keys there [06:14.400 --> 06:17.400] and assign special behavior to them. [06:18.400 --> 06:21.400] Now, let's get back to the topic of zone namespaces [06:21.400 --> 06:24.400] because I only use some explanation here. [06:24.400 --> 06:26.400] Back when we had hard drives, [06:26.400 --> 06:28.400] we could perform arbitrary reads and writes [06:28.400 --> 06:30.400] to arbitrary sectors. [06:30.400 --> 06:32.400] Sectors could be rewritten all the time [06:32.400 --> 06:37.400] without requiring any erasure beforehand. [06:37.400 --> 06:41.400] This is what is known as the traditional block interface. [06:41.400 --> 06:43.400] But there's a problem, [06:43.400 --> 06:45.400] and that is that NAND flash doesn't actually [06:45.400 --> 06:47.400] support this behavior. [06:47.400 --> 06:49.400] So, when you have NAND flash, [06:49.400 --> 06:53.400] your sectors are concentrated in blocks [06:53.400 --> 06:56.400] and this block needs to be linearly written. [06:56.400 --> 07:00.400] And before you can rewrite the information in a block, [07:00.400 --> 07:02.400] the block needs to be erased as a whole. [07:02.400 --> 07:04.400] So, in order to accommodate, [07:04.400 --> 07:06.400] flash SSDs have to incorporate [07:06.400 --> 07:08.400] what is known as a flash translation layer, [07:08.400 --> 07:10.400] where basically all these requests [07:10.400 --> 07:12.400] that go to the same sectors [07:12.400 --> 07:15.400] are somehow translated and put somewhere else physically, [07:15.400 --> 07:18.400] just so that the user can still use this same block interface [07:18.400 --> 07:21.400] that they have been used to from the time of hard drives. [07:21.400 --> 07:23.400] So, there's this physical translation [07:23.400 --> 07:25.400] between these logical and physical blocks, [07:25.400 --> 07:28.400] and when we try to synchronize the file system [07:28.400 --> 07:31.400] from the host with the device while a kernel is running, [07:31.400 --> 07:34.400] this introduces a whole lot of problems. [07:34.400 --> 07:35.400] So, how do we solve this? [07:35.400 --> 07:36.400] Now, you know the answer. [07:36.400 --> 07:38.400] It's the sound namespaces. [07:38.400 --> 07:41.400] We basically present an interface that's not the block interface, [07:41.400 --> 07:45.400] and it's an interface that fits to NAND flash behavior. [07:45.400 --> 07:48.400] So, when you use the sound namespaces SSD, [07:48.400 --> 07:52.400] you need, as a developer of a file system or the kernel, [07:52.400 --> 07:55.400] need to linearly write each sector in the block, [07:55.400 --> 07:57.400] and you need to erase the block as a whole. [07:57.400 --> 08:02.400] So, effectively, you become the manager of this SSD, [08:02.400 --> 08:04.400] the flash translation layer, [08:04.400 --> 08:07.400] and the garbage collection lifts on the host, [08:07.400 --> 08:10.400] and we call this whole system host-managed. [08:10.400 --> 08:14.400] If we now combine this with a log-structured file system, [08:14.400 --> 08:17.400] which also didn't have any in-place updates, [08:17.400 --> 08:20.400] then you naturally see that this becomes a very good fit. [08:20.400 --> 08:23.400] And now, together with these two technologies, [08:23.400 --> 08:26.400] we can finally synchronize the host and the file system, [08:26.400 --> 08:31.400] and we can do that by making the file temporarily immutable [08:31.400 --> 08:33.400] while the kernel is running. [08:33.400 --> 08:36.400] And we do that using a snapshot consistency model [08:36.400 --> 08:38.400] by creating in-memory snapshots. [08:38.400 --> 08:41.400] So, we were able to create a representation of the file [08:41.400 --> 08:44.400] as it was on the host with metadata, [08:44.400 --> 08:47.400] put that to the computational storage device memory, [08:47.400 --> 08:50.400] and we can assure that all the data that is there [08:50.400 --> 08:54.400] will remain immutable during the execution of the kernel. [08:54.400 --> 08:58.400] Meanwhile, the user can actually still write to the file, [08:58.400 --> 09:01.400] and the metadata of the file on the host will differ, [09:01.400 --> 09:04.400] but that's not a problem. [09:04.400 --> 09:06.400] So, this is very powerful, [09:06.400 --> 09:10.400] and it allows us to also understand kernel behavior in a way, [09:10.400 --> 09:14.400] because we can now have metadata [09:14.400 --> 09:16.400] and send it to the computational storage device [09:16.400 --> 09:19.400] that says, well, actually, if the kernel tries to do this, [09:19.400 --> 09:21.400] remember, it's a user-submitted program, [09:21.400 --> 09:22.400] it might be malicious, [09:22.400 --> 09:24.400] then we want to block those actions, [09:24.400 --> 09:27.400] so we have a security interface as well. [09:27.400 --> 09:29.400] The final kick in the bucket for this design [09:29.400 --> 09:32.400] is that we want to be architecture-independent, [09:32.400 --> 09:34.400] and we do that through EBPF, [09:34.400 --> 09:36.400] the system that you're also using for network hooks [09:36.400 --> 09:39.400] and event hooks in the Linux kernel nowadays. [09:39.400 --> 09:44.400] With EBPF, you can define system calls [09:44.400 --> 09:46.400] and expose those in a header, [09:46.400 --> 09:48.400] and this is actually the format of how you would do that, [09:48.400 --> 09:50.400] that's a real example, [09:50.400 --> 09:52.400] and the vendor would implement that code, [09:52.400 --> 09:56.400] and you would define in a specification some behavior, [09:56.400 --> 09:59.400] but the vendor doesn't have to open source their code, [09:59.400 --> 10:02.400] which, in the case of Flash, SSDs and vendors, [10:02.400 --> 10:06.400] is pretty important because they don't seem to be that keen on that, [10:06.400 --> 10:08.400] and this way, we can still have an interface, [10:08.400 --> 10:11.400] the users can write programs once [10:11.400 --> 10:14.400] and reuse them across all vendors without any problem, [10:14.400 --> 10:16.400] and the nice thing about EBPF [10:16.400 --> 10:21.400] is that this instruction set architecture, what EBPF essentially is, [10:21.400 --> 10:24.400] is easily implementable in a VM. [10:24.400 --> 10:28.400] So there's even pre-existing open source implementations of this, [10:28.400 --> 10:33.400] and that's what we're using, UBPF. [10:33.400 --> 10:36.400] Now that I've explained all the key components [10:36.400 --> 10:39.400] to OpenCSD and FluffleFS, [10:39.400 --> 10:42.400] I want to start with a little demo [10:42.400 --> 10:45.400] and show you what are some of the actual practical use cases for this. [10:45.400 --> 10:49.400] So how can we use such a computational storage system [10:49.400 --> 10:53.400] in a way that it makes sense in terms of data reduction [10:53.400 --> 10:55.400] and energy efficiency? [10:55.400 --> 10:58.400] And for that, we're going to go to the example of Shannon entropy. [10:58.400 --> 11:01.400] This is heavily used by file systems [11:01.400 --> 11:03.400] who can perform background compression [11:03.400 --> 11:08.400] or by just compression programs that compress in the background. [11:08.400 --> 11:12.400] What you basically do is you try to quantify the randomness you have in a file. [11:12.400 --> 11:15.400] Typically, it's between 0 and 1, but for computers, [11:15.400 --> 11:17.400] that doesn't really make sense. [11:17.400 --> 11:20.400] So we use this log b that's over here [11:20.400 --> 11:23.400] to normalize this for bytes. [11:23.400 --> 11:27.400] Then we can say what's the distribution of bytes. [11:27.400 --> 11:31.400] So we create, because a byte has 265 different possible values, [11:31.400 --> 11:33.400] we create 265 bins, [11:33.400 --> 11:36.400] and we submit a program to calculate this. [11:36.400 --> 11:38.400] It runs in the background, [11:38.400 --> 11:43.400] and only the result is returned to the host operating system. [11:43.400 --> 11:46.400] And then the host operating system is free to decide [11:46.400 --> 11:49.400] whether or not this file should be compressed or not. [11:49.400 --> 11:53.400] So how does such a kernel look like, [11:53.400 --> 11:57.400] the kernel that you actually submit to the computational storage device, [11:57.400 --> 12:00.400] or you can just write them in C and compile them with Clang. [12:00.400 --> 12:02.400] So you write them in C, [12:02.400 --> 12:05.400] and we have two individual interfaces here [12:05.400 --> 12:08.400] that we are exposing. The yellow commands, [12:08.400 --> 12:11.400] those are introduced by the system calls, [12:11.400 --> 12:14.400] the ebpfi that we are defining, [12:14.400 --> 12:16.400] and the purple ones, [12:16.400 --> 12:20.400] those are introduced by a file system. [12:20.400 --> 12:25.400] What that means is that using this system as is now, [12:25.400 --> 12:28.400] that it's not agnostic to the file system. [12:28.400 --> 12:31.400] So it is agnostic to the vendor, [12:31.400 --> 12:33.400] and the architecture of the vendor. [12:33.400 --> 12:36.400] So we have this ARM or x86, that doesn't matter, [12:36.400 --> 12:40.400] but now it's specific to the FluffleFS file system that we have written. [12:40.400 --> 12:44.400] And I will address some possible solutions for this at the end. [12:44.400 --> 12:49.400] Other things we need to realize is that the ebpf stack size is typically very small. [12:49.400 --> 12:52.400] We're talking bytes here instead of kilobytes. [12:52.400 --> 12:55.400] So we need a way to address this. [12:55.400 --> 13:00.400] So what you can do is in ubpf you can allocate a heap, [13:00.400 --> 13:02.400] just as your stack, [13:02.400 --> 13:07.400] and then we have this bpf getmem info that we have defined as part of the ABI [13:07.400 --> 13:10.400] that allows you to get your heap pointer. [13:10.400 --> 13:13.400] Now currently you have to manually offset this, [13:13.400 --> 13:15.400] which is a bit tedious, if you will. [13:15.400 --> 13:18.400] You see that that is actually done here. [13:18.400 --> 13:22.400] To store the bins, we offset the buffer by the sector size, [13:22.400 --> 13:28.400] and then the data from the sector reads is actually stored at the top of the buffer, [13:28.400 --> 13:32.400] and the bins are stored at the offset for precisely one sector size. [13:32.400 --> 13:36.400] Now when we go to look at the file system interface [13:36.400 --> 13:41.400] and all the helpers and data structures and additional function calls that we introduced, [13:41.400 --> 13:46.400] we can later see that we can also make a basic implementation of malloc and free here [13:46.400 --> 13:48.400] and then just resolve this. [13:48.400 --> 13:53.400] But for now, for this example, it's a bit tedious. [13:53.400 --> 13:55.400] Now how do you actually trigger this? [13:55.400 --> 13:59.400] So we had the extended attributes, we had all these systems in place, [13:59.400 --> 14:03.400] but now you just have this kernel, you have compiled it, you have stored it to a file, [14:03.400 --> 14:07.400] and now you want to actually offload your computation, [14:07.400 --> 14:09.400] well, in an emulated fashion, [14:09.400 --> 14:12.400] but you want to learn, you want to see how you do that. [14:12.400 --> 14:16.400] So the first thing you do is you call start on the kernel object. [14:16.400 --> 14:18.400] So this is your compiled diecode, [14:18.400 --> 14:21.400] and then you get the inode number. [14:21.400 --> 14:28.400] This inode number you have then to remember and you then open the file that you want to read upon [14:28.400 --> 14:32.400] or write upon, but for the examples we're using read mostly. [14:32.400 --> 14:37.400] Then you use set extended attribute, you use our reserved key, [14:37.400 --> 14:40.400] you set it to the inode number of the kernel file, [14:40.400 --> 14:43.400] and then when you actually issue read commands, [14:43.400 --> 14:48.400] the read commands will actually go to the computational storage device [14:48.400 --> 14:51.400] and they'll run on there. [14:51.400 --> 14:53.400] But when do you actually take these snapshots? [14:53.400 --> 14:56.400] And the trick is as soon as you set extended attributes, [14:56.400 --> 14:58.400] this is just by design, right? [14:58.400 --> 15:01.400] It could also be once you call the first read or once you execute the first write, [15:01.400 --> 15:05.400] but we have decided to do it at the moment that you set extended attribute. [15:05.400 --> 15:08.400] That means that if you make any changes to your kernel, [15:08.400 --> 15:11.400] once you've actually set extended attribute, [15:11.400 --> 15:13.400] then nothing changes anymore. [15:13.400 --> 15:16.400] And the same goes to the file. [15:16.400 --> 15:20.400] Now I want to briefly explain some different types of kernel that you can have, [15:20.400 --> 15:26.400] and what the example here is mainly showing is what we call a stream kernel. [15:26.400 --> 15:30.400] So a stream kernel happens in place of the regular read or write request. [15:30.400 --> 15:33.400] So the regular read or write request doesn't happen, [15:33.400 --> 15:40.400] only the computational storage request happens on the computational storage device. [15:40.400 --> 15:44.400] And with an event kernel, it's like the opposite way around. [15:44.400 --> 15:47.400] First, the regular event happens normally, [15:47.400 --> 15:51.400] and then the kernel is presented with the metadata from that request [15:51.400 --> 15:53.400] and can do additional things. [15:53.400 --> 15:55.400] This is for databases interesting. [15:55.400 --> 15:58.400] For example, say you're writing a big table, [15:58.400 --> 16:01.400] and you want to know the average or the minimum or the maximum, [16:01.400 --> 16:05.400] and you want to emit that as metadata at the end of your table write. [16:05.400 --> 16:09.400] While you could use an event kernel to let it write as is, [16:09.400 --> 16:12.400] then you get presented with the data, [16:12.400 --> 16:15.400] and the kernel runs on the computational storage device, [16:15.400 --> 16:17.400] and you emit the metadata after, [16:17.400 --> 16:21.400] and you can store that as like an index. [16:21.400 --> 16:27.400] We have also decided to isolate the context of this computational storage offloading, [16:27.400 --> 16:32.400] so what is considered, once you set the attribute, by PID. [16:32.400 --> 16:35.400] But we also could make this by file handle, [16:35.400 --> 16:38.400] or you could even set it for the whole line node. [16:38.400 --> 16:44.400] More so, we could use specific keys for file handle PID or I node offloading, [16:44.400 --> 16:49.400] so it's just a matter of semantics here. [16:49.400 --> 16:55.400] Now, I have some source code in Python of these execution steps that I've just shown here, [16:55.400 --> 16:59.400] because there's a little bit of details that I left out in the brief overview. [16:59.400 --> 17:03.400] The first is that you have to stride your requests, [17:03.400 --> 17:06.400] and those have to be strided by 500 to 12k. [17:06.400 --> 17:08.400] Why is this so? [17:08.400 --> 17:12.400] Well, infuse the amount of kernel pages that are allocated to move data between the kernel [17:12.400 --> 17:15.400] and the user space is statically fixed. [17:15.400 --> 17:20.400] So if you go over this, then your request will seem filing from the user perspective, [17:20.400 --> 17:23.400] but what the kernel will do is it will chop up your requests. [17:23.400 --> 17:25.400] Why is that problematic? [17:25.400 --> 17:28.400] Well, then multiple kernels spawn, [17:28.400 --> 17:30.400] because from the context of the file system, [17:30.400 --> 17:32.400] every time it sees a read or write request, [17:32.400 --> 17:40.400] it will go to the kernel and move it to the computational storage device. [17:40.400 --> 17:44.400] Then here you can see how I set the extended attribute and get the kernel, [17:44.400 --> 17:46.400] the I node number, [17:46.400 --> 17:52.400] and what I want to show here at the bottom is that I'm getting 265 integers, [17:52.400 --> 17:56.400] and that's for each of the buckets of the entropy read, [17:56.400 --> 17:59.400] but I'm having a request of 512k. [17:59.400 --> 18:04.400] So that shows you the amount of data reduction that you can achieve using systems like this. [18:04.400 --> 18:08.400] 265 integers, 512k. [18:08.400 --> 18:09.400] Pretty good. [18:09.400 --> 18:10.400] Could be better though. [18:10.400 --> 18:15.400] The reason it's not better is floating point support in EBPF is limited [18:15.400 --> 18:19.400] to the fact where you need to implement fixed point match yourself. [18:19.400 --> 18:22.400] So we could do this as part of the file system helpers, [18:22.400 --> 18:28.400] but that's not done for this prototype at the moment. [18:28.400 --> 18:29.400] Now, some limitations. [18:29.400 --> 18:31.400] This was a master thesis work. [18:31.400 --> 18:34.400] This was my first time defining a file system ever. [18:34.400 --> 18:36.400] It's solely a proof of concept. [18:36.400 --> 18:41.400] There's no garbage collection, no deletion, no space reclaiming. [18:41.400 --> 18:44.400] Please don't use it. [18:44.400 --> 18:46.400] Please don't use it to store your files. [18:46.400 --> 18:47.400] Yeah. [18:47.400 --> 18:51.400] EBPF has an ending in this, just like any ISA would have, [18:51.400 --> 18:53.400] and there's currently no conversions. [18:53.400 --> 18:56.400] So if you happen to use something that uses different ending in this, [18:56.400 --> 18:58.400] all your data will be upside down. [18:58.400 --> 19:01.400] So you have to deal with that yourself for now, [19:01.400 --> 19:05.400] but once again, we can make it part of the file system helpers [19:05.400 --> 19:08.400] to help with these data structure layout conversions [19:08.400 --> 19:10.400] and the engineers conversions. [19:10.400 --> 19:12.400] As I mentioned briefly earlier, [19:12.400 --> 19:15.400] floating point support in EBPF is practically non-existent, [19:15.400 --> 19:19.400] but we can implement fixed point match. [19:19.400 --> 19:24.400] And currently, I haven't shown any performance examples [19:24.400 --> 19:27.400] because I don't think that they are that interesting [19:27.400 --> 19:31.400] because what's currently happening when you emulate offloading [19:31.400 --> 19:35.400] is that it just runs on the host processor as is in EBPF. [19:35.400 --> 19:38.400] So it isn't representative of the microcontrollers [19:38.400 --> 19:40.400] that you would find on SSDs. [19:40.400 --> 19:43.400] So the runtime, the time that it would take to execute these kernels [19:43.400 --> 19:45.400] would be much too fast. [19:45.400 --> 19:48.400] So that's something that we need to deal on, I think, [19:48.400 --> 19:51.400] because then we can more easily reason about [19:51.400 --> 19:54.400] what would be the actual performance if we would offload [19:54.400 --> 19:56.400] these applications to SSDs. [19:56.400 --> 20:00.400] Frankly, these SSDs do have very capable microcontrollers, [20:00.400 --> 20:02.400] typically even multi-core processors. [20:02.400 --> 20:05.400] The reason they do that is because they need to manage [20:05.400 --> 20:07.400] your flash sensations layer. [20:07.400 --> 20:10.400] So they are already fairly capable devices, actually. [20:10.400 --> 20:13.400] Only read stream kernels have been fully implemented [20:13.400 --> 20:15.400] for this prototype as well. [20:15.400 --> 20:18.400] And that's mainly because event kernel performance [20:18.400 --> 20:22.400] is problematic because the data from the event kernel, [20:22.400 --> 20:25.400] remember the IO request happens regularly, [20:25.400 --> 20:29.400] so all the data is moved back into the host processor [20:29.400 --> 20:32.400] and only then is the event kernel started. [20:32.400 --> 20:35.400] But what you really need is a two-stage system [20:35.400 --> 20:39.400] where you prevent the data being moved back from the host. [20:39.400 --> 20:42.400] This requires some more tinkering. [20:42.400 --> 20:45.400] And the final thing, we need to make this agnostic [20:45.400 --> 20:49.400] to the file system. And we can very easily achieve this [20:49.400 --> 20:52.400] using this file system runtime, [20:52.400 --> 20:55.400] where to an ICD, an installable client driver, [20:55.400 --> 20:59.400] much the same way that Falcon and OpenCL and OpenGL are working, [20:59.400 --> 21:01.400] you can dynamically load a shared library [21:01.400 --> 21:05.400] that implements all the functions you have defined in the header. [21:05.400 --> 21:08.400] And this can also dynamically compile your programs [21:08.400 --> 21:12.400] and then store the cache versions of this program. [21:12.400 --> 21:15.400] And using StataFS, we can easily identify [21:15.400 --> 21:17.400] on what file system is running. [21:17.400 --> 21:20.400] And that allows users to write their programs one, [21:20.400 --> 21:24.400] run on any architecture and for any computational file system, [21:24.400 --> 21:28.400] which I think is pretty powerful and flexible. [21:28.400 --> 21:32.400] So that's it. I encourage you to try this. [21:32.400 --> 21:34.400] I've also written a thesis on this [21:34.400 --> 21:36.400] that does have some performance metrics. [21:36.400 --> 21:39.400] It also shows you some interesting data structures [21:39.400 --> 21:41.400] that we had to design for the file system [21:41.400 --> 21:44.400] to be able to support these in-memory snapshots. [21:44.400 --> 21:47.400] There's a previous work called ZCSD [21:47.400 --> 21:51.400] that also has some early performance information. [21:51.400 --> 21:54.400] And I've written quite an extensive survey [21:54.400 --> 21:57.400] on the last decade's history or so [21:57.400 --> 21:59.400] of computational flash storage devices, [21:59.400 --> 22:01.400] which also quite interesting. [22:01.400 --> 22:03.400] So thank you. [22:03.400 --> 22:12.400] APPLAUSE [22:12.400 --> 22:14.400] Seven minutes for questions. [22:14.400 --> 22:16.400] Oh, that's quite good. [22:19.400 --> 22:21.400] I imagine this is quite difficult, right? [22:21.400 --> 22:23.400] Computational storage, what the fuck's that? [22:23.400 --> 22:27.400] So please don't hesitate to ask questions [22:27.400 --> 22:29.400] if anything is unclear. [22:29.400 --> 22:34.400] What's the availability of hardware that can do this? [22:34.400 --> 22:36.400] The computational storage? [22:36.400 --> 22:38.400] Yes, the computational storage. [22:38.400 --> 22:43.400] There is one vendor that is selling a computational storage device [22:43.400 --> 22:46.400] that's not based on zoned namespaces storage. [22:46.400 --> 22:49.400] So it's using conventional SSDs [22:49.400 --> 22:53.400] and it supports computational storage to a network interface. [22:53.400 --> 22:55.400] So you have the normal PCIe interface [22:55.400 --> 23:00.400] and then there's this transport over need to do TCPIP [23:00.400 --> 23:03.400] and then you basically just connect over it to SSH [23:03.400 --> 23:05.400] and then you can do things on the SSD. [23:05.400 --> 23:07.400] That one's commercially available. [23:07.400 --> 23:12.400] I don't know what they would ask for that product. [23:12.400 --> 23:16.400] What does ZCSD have to do with zoned namespaces? [23:16.400 --> 23:18.400] Nothing in principle, [23:18.400 --> 23:20.400] but you need a way to synchronize the file system [23:20.400 --> 23:22.400] between the host and device [23:22.400 --> 23:24.400] and zoned namespaces make that trivial, [23:24.400 --> 23:26.400] whereas conventional SSDs, [23:26.400 --> 23:29.400] the logical and physical block translation, [23:29.400 --> 23:31.400] severely hinders this, [23:31.400 --> 23:34.400] makes it extremely difficult to perform. [23:37.400 --> 23:40.400] So why didn't you include the performance projects [23:40.400 --> 23:43.400] from your pieces or better ones? [23:43.400 --> 23:45.400] Because the performance... [23:45.400 --> 23:46.400] Oh, sorry. [23:46.400 --> 23:47.400] Oh, yeah. [23:47.400 --> 23:48.400] Why don't I... [23:48.400 --> 23:49.400] Very good. [23:49.400 --> 23:51.400] I forgot that actually all the time. [23:51.400 --> 23:55.400] Yeah, so why didn't I include any performance metrics [23:55.400 --> 23:57.400] if I have them? [23:57.400 --> 24:00.400] And the answer is because I don't think I would have time [24:00.400 --> 24:03.400] and I don't think they're interesting enough to include. [24:03.400 --> 24:05.400] This is a very complicated subject. [24:05.400 --> 24:07.400] It's very new for most people, computational search. [24:07.400 --> 24:09.400] Most people have never heard of it. [24:09.400 --> 24:12.400] So I much rather spend the time to explain this properly [24:12.400 --> 24:15.400] and try to show you that this is a very interesting concept [24:15.400 --> 24:17.400] to solve this bandwidth gap [24:17.400 --> 24:18.400] rather than show you some metrics [24:18.400 --> 24:20.400] that are not representative anyway [24:20.400 --> 24:22.400] because the kernel is running on the host CPU [24:22.400 --> 24:24.400] and you're not going to have an additional host CPU [24:24.400 --> 24:26.400] on the Flash SSD. [24:28.400 --> 24:31.400] Can you talk about what kind of test setup [24:31.400 --> 24:33.400] you have for your metric? [24:33.400 --> 24:34.400] So I don't... [24:34.400 --> 24:36.400] Yeah, of the metrics themselves. [24:36.400 --> 24:37.400] Yeah, so the framework... [24:37.400 --> 24:39.400] Okay, yeah, yeah, very good. [24:39.400 --> 24:41.400] What kind of test setup I had [24:41.400 --> 24:43.400] to do all these analyses [24:43.400 --> 24:45.400] and to try these things out. [24:45.400 --> 24:47.400] So I run Camu on my own host machine, [24:47.400 --> 24:49.400] just a normal laptop, basically this one. [24:49.400 --> 24:52.400] And Camu then creates a virtual [24:52.400 --> 24:54.400] sound namespaces device [24:54.400 --> 24:56.400] that's actually quite recently introduced to Camu. [24:56.400 --> 24:58.400] So you can now try sound namespaces [24:58.400 --> 25:00.400] without owning sound namespaces. [25:00.400 --> 25:02.400] That's the whole reason Camu comes into play [25:02.400 --> 25:05.400] because otherwise people wouldn't need [25:05.400 --> 25:07.400] to buy a sound namespaces SSD [25:07.400 --> 25:09.400] which is quite badly available. [25:09.400 --> 25:12.400] And then you just run the prototype as is. [25:12.400 --> 25:14.400] So that's all you need. [25:14.400 --> 25:17.400] And you really don't need any special hardware. [25:17.400 --> 25:20.400] Yeah, it could be even on an ARM laptop. [25:20.400 --> 25:22.400] It doesn't matter. [25:22.400 --> 25:23.400] Did you test it? [25:23.400 --> 25:25.400] No, I did not test it. [25:25.400 --> 25:28.400] But whether or not I tested if it works on ARM. [25:28.400 --> 25:30.400] The answer is no, I did not test it. [25:30.400 --> 25:32.400] But I'm pretty sure Camu compiles some ARM. [25:32.400 --> 25:34.400] So I'm pretty sure we're good there. [25:34.400 --> 25:36.400] Because you have to remember [25:36.400 --> 25:39.400] that that's maybe not intrinsically clear [25:39.400 --> 25:40.400] from this presentation. [25:40.400 --> 25:42.400] But we didn't extend Camu in any way. [25:42.400 --> 25:44.400] It's just a normal Camu cumulation. [25:44.400 --> 25:46.400] You don't even need to custom install it. [25:46.400 --> 25:48.400] You can just get it from the package manager [25:48.400 --> 25:50.400] and use this. [25:50.400 --> 25:52.400] I have a lot of questions about that. [25:52.400 --> 25:55.400] Regarding the computational part, [25:55.400 --> 25:59.400] what are the limitations of what kind of CPU or kernel [25:59.400 --> 26:02.400] that it may run on these devices? [26:02.400 --> 26:04.400] I think the main limitations, [26:04.400 --> 26:06.400] what are the limitations as the kernels [26:06.400 --> 26:08.400] that you run on these devices? [26:08.400 --> 26:12.400] Well, first of all, you need to have data reduction, right? [26:12.400 --> 26:15.400] If you're going to read one gigabyte from the flash storage [26:15.400 --> 26:18.400] and you're going to return one gigabyte of data to the host, [26:18.400 --> 26:20.400] then there's no real point in offloading this [26:20.400 --> 26:23.400] because the data is going to be moved anyway. [26:23.400 --> 26:26.400] So the first thing that the limitation is [26:26.400 --> 26:28.400] that you have to find an application [26:28.400 --> 26:30.400] that is reductive in nature. [26:30.400 --> 26:33.400] Once you do the computation, you return less data. [26:33.400 --> 26:38.400] The nice thing is that's 99% of all workloads, right? [26:38.400 --> 26:40.400] So that's pretty good. [26:40.400 --> 26:44.400] And the second thing is that if it's timing critical [26:44.400 --> 26:46.400] and the computation takes a long time, [26:46.400 --> 26:48.400] then it's probably not that interesting [26:48.400 --> 26:51.400] because the latency will then be too bad [26:51.400 --> 26:53.400] because the performance of these cores [26:53.400 --> 26:55.400] is much less than your host processor. [26:55.400 --> 26:59.400] But you can implement specialized instructions [26:59.400 --> 27:01.400] that could be very efficient [27:01.400 --> 27:03.400] in doing database filtering or things like this. [27:03.400 --> 27:06.400] And that is where the whole ASIC and FPGA part [27:06.400 --> 27:08.400] would come into play. [27:08.400 --> 27:12.400] But if it's not timing critical and it's in the background, [27:12.400 --> 27:14.400] like the Shandom entropy compression, [27:14.400 --> 27:16.400] those are ideal cases. [27:16.400 --> 27:20.400] Reduction in data and not timing critical. [27:20.400 --> 27:23.400] So what you mean is we can have software kernels [27:23.400 --> 27:26.400] with the back end in hardware [27:26.400 --> 27:29.400] so we can also program the thermo. [27:29.400 --> 27:34.400] So like maybe a core like CPU board or GPU board. [27:34.400 --> 27:36.400] To repeat the question, [27:36.400 --> 27:38.400] whether or not it's just software [27:38.400 --> 27:41.400] or whether we also program the hardware. [27:41.400 --> 27:44.400] Of course, FPGAs can be reprogrammed on the fly [27:44.400 --> 27:46.400] and we have seen prototypes in the past [27:46.400 --> 27:49.400] for computational storage devices where they do just that. [27:49.400 --> 27:52.400] From the host device, the user sends a bit stream [27:52.400 --> 27:54.400] that dynamically reprograms the FPGA [27:54.400 --> 27:56.400] and then the kernel starts running. [27:56.400 --> 27:58.400] That's not what we're trying to achieve here. [27:58.400 --> 28:01.400] What I envision in this is that the FPGA [28:01.400 --> 28:06.400] has specialized logic to do certain computations [28:06.400 --> 28:10.400] and then from the ABI, from the EBPF ABI, [28:10.400 --> 28:12.400] whether code triggers those instructions [28:12.400 --> 28:15.400] will utilize the FPGA to do those computations [28:15.400 --> 28:19.400] but they would be defined in the specification beforehand. [28:19.400 --> 28:22.400] Because typically in reflashing a FPGA [28:22.400 --> 28:24.400] with a new bit stream takes quite some time [28:24.400 --> 28:26.400] so in the interest of performance [28:26.400 --> 28:29.400] it might not be that interesting. [28:34.400 --> 28:36.400] So I'm going to ask a question. [28:36.400 --> 28:43.400] You might have mentioned it but are there close source competitors? [28:43.400 --> 28:45.400] If there are close source competitors [28:45.400 --> 28:47.400] in the space of computational storage. [28:47.400 --> 28:49.400] Well, actually that's one of the things [28:49.400 --> 28:51.400] that's been growing really well in this scene. [28:51.400 --> 28:55.400] I'd say the vast majority of everything is open source. [28:55.400 --> 29:00.400] At least if you look at the recent things, [29:00.400 --> 29:03.400] if you look at the past decade then it's a bit worse [29:03.400 --> 29:05.400] because there is a lot of research published [29:05.400 --> 29:07.400] that doesn't actually publish the source code [29:07.400 --> 29:09.400] or rather the source code is published [29:09.400 --> 29:11.400] but everything is a hardware prototype [29:11.400 --> 29:13.400] and they didn't publish the bit streams [29:13.400 --> 29:15.400] or the FHDL or the Farrellock [29:15.400 --> 29:17.400] so you're then stuck as well [29:17.400 --> 29:19.400] or they didn't have any PCB designs [29:19.400 --> 29:22.400] so you can't reproduce the work if you will. [29:22.400 --> 29:25.400] I say this is a much bigger problem [29:25.400 --> 29:28.400] than just computational storage in the field of academia [29:28.400 --> 29:31.400] but it's also present here. [29:31.400 --> 29:36.400] Yes. [29:36.400 --> 29:39.400] Which one? [29:39.400 --> 29:41.400] The Python code. [29:49.400 --> 29:51.400] Complexity in terms of? [29:51.400 --> 30:01.400] The reason this is a nested loop, [30:01.400 --> 30:04.400] in the phase of performance I have a nested loop here [30:04.400 --> 30:07.400] so why that and why in the terms of performance how? [30:07.400 --> 30:09.400] And Python. [30:09.400 --> 30:13.400] The trick is this is just for demonstration purposes [30:13.400 --> 30:15.400] that's one you can easily make this example [30:15.400 --> 30:18.400] in C or C++ and you shit if you care about performance. [30:18.400 --> 30:21.400] The trick is that this program is already spending 99% [30:21.400 --> 30:22.400] of its time in IO weight [30:22.400 --> 30:24.400] because it's waiting for the kernel to complete [30:24.400 --> 30:26.400] so in the phase of that it's not that interesting [30:26.400 --> 30:29.400] and the reason we have a nested loop [30:29.400 --> 30:32.400] is because the floating point performance in EBP, [30:32.400 --> 30:36.400] the floating point in EBPF is not existent [30:36.400 --> 30:38.400] or at least I didn't implement a fixed point mod [30:38.400 --> 30:40.400] so what I have to do after this [30:40.400 --> 30:42.400] at the bottom what you don't see here [30:42.400 --> 30:45.400] is that from all these buckets of these bins [30:45.400 --> 30:48.400] I'm actually computing the distribution [30:48.400 --> 30:51.400] using floating point math in Python [30:51.400 --> 30:54.400] which is why I don't get a single number from this kernel [30:54.400 --> 30:58.400] because if I would have floating point implementation [30:58.400 --> 31:03.400] in EBPF I could already do that computation in EBPF [31:03.400 --> 31:07.400] and only return a single 32-bit float as a result [31:07.400 --> 31:10.400] instead of these 265 integers. [31:10.400 --> 31:14.400] But I still, the reason this is a loop [31:14.400 --> 31:17.400] is because I still have to strike for the read request [31:17.400 --> 31:19.400] because I can't go above 5 on the 12K [31:19.400 --> 31:22.400] even if my file is bigger than 5 on the 12K. [31:22.400 --> 31:26.400] You said it's spending a lot of time in IO weight. [31:26.400 --> 31:29.400] Couldn't you just write it there just to prove it [31:29.400 --> 31:32.400] or it doesn't make any sense in this case? [31:32.400 --> 31:36.400] Well, the trick is, okay, [31:36.400 --> 31:40.400] couldn't I implement multi-treading here? [31:40.400 --> 31:45.400] Currently, the EBPFVM runs as a single process [31:45.400 --> 31:48.400] so even if you submit multiple kernels [31:48.400 --> 31:50.400] only one will execute at a time. [31:50.400 --> 31:53.400] Why? It's a thesis prototype, right? [31:53.400 --> 31:55.400] Time, things like this. [31:55.400 --> 31:57.400] Okay. [31:57.400 --> 31:58.400] Thank you very much. [31:58.400 --> 32:10.400] No worries.