[00:00.000 --> 00:09.680]  Okay. So, hello, everyone.
[00:10.680 --> 00:14.320]  This presentation is about open CSD,
[00:14.320 --> 00:17.280]  which is a computational storage emulation platform.
[00:17.280 --> 00:19.320]  And the reason we're emulating that,
[00:19.320 --> 00:20.760]  I'll get into shortly.
[00:20.760 --> 00:23.440]  But first, I think I owe you an explanation of
[00:23.440 --> 00:26.200]  computational storage and what it actually is.
[00:26.200 --> 00:28.640]  Because I don't think many people are familiar with that.
[00:28.640 --> 00:29.960]  Even in this deaf room,
[00:29.960 --> 00:33.320]  but I'm pretty sure most people are familiar with Cameroon eBPF.
[00:33.320 --> 00:35.920]  You can email me. There's a link to the repo.
[00:35.920 --> 00:41.440]  And this has been a long time collaboration with my master's thesis at the food.
[00:41.440 --> 00:43.480]  So, let's get started.
[00:43.480 --> 00:45.640]  I'm going to briefly explain who am I.
[00:45.640 --> 00:48.520]  I'm Cornelucca. My handle online is mostly Dentalian.
[00:48.520 --> 00:50.840]  I'm also a licensed ham radio operator,
[00:50.840 --> 00:53.080]  Popa Delta 3 Sierra Uniform that is.
[00:53.080 --> 00:56.520]  And my expertise is in parallel and distributed system.
[00:56.520 --> 00:59.200]  I've been in academia for some while, associate degree,
[00:59.200 --> 01:01.080]  bachelor degree, master's degree.
[01:01.080 --> 01:03.880]  And I've had some experiences throughout that time.
[01:03.880 --> 01:06.920]  So, I've worked on health technology for officially impact people.
[01:06.920 --> 01:09.840]  Worked on OpenStack with cloud optimizations.
[01:09.840 --> 01:12.960]  I've done computational storage for my master's thesis.
[01:12.960 --> 01:14.400]  That's what this talk is about.
[01:14.400 --> 01:15.640]  And currently, we're working on
[01:15.640 --> 01:19.520]  SCADA systems for the lower two radio telescope at Astron.
[01:19.520 --> 01:23.040]  So, why do we actually need computational storage?
[01:23.040 --> 01:24.320]  And that's because we live in
[01:24.320 --> 01:26.720]  a data-driven society nowadays.
[01:26.720 --> 01:29.400]  So, the world is practically exploding with data,
[01:29.400 --> 01:31.840]  so much so that we're expected to store
[01:31.840 --> 01:34.600]  200 setabytes of data by 2050.
[01:34.600 --> 01:37.240]  And these high data and throughput requirements
[01:37.240 --> 01:39.120]  pose significant challenges on
[01:39.120 --> 01:42.640]  storage interfaces and technologies that we are using today.
[01:42.640 --> 01:47.080]  So, if you look at your traditional computer architecture,
[01:47.080 --> 01:49.600]  the one that's being used on X86,
[01:49.600 --> 01:51.920]  it's based on the von Neumann architecture.
[01:51.920 --> 01:54.520]  And here, we basically need to move all data
[01:54.520 --> 01:58.040]  into main system memory before we can begin processing.
[01:58.040 --> 02:00.240]  So, this poses memory bottlenecks and
[02:00.240 --> 02:02.400]  internet interconnect bottlenecks on
[02:02.400 --> 02:04.400]  networks or PCI Express,
[02:04.400 --> 02:06.400]  and it also drastically
[02:06.400 --> 02:09.240]  ad hinders energy efficiency to an extent.
[02:09.240 --> 02:13.120]  So, how much of a bandwidth gap are we talking here?
[02:13.120 --> 02:16.400]  Well, if you look at the server from 2021,
[02:16.400 --> 02:19.840]  say using Epic Milan with 64 SSDs,
[02:19.840 --> 02:22.840]  we're losing about four and a half times the amount of bandwidth
[02:22.840 --> 02:25.000]  that could be offered by all the SSDs in tandem,
[02:25.000 --> 02:26.520]  but can't be utilized because we
[02:26.520 --> 02:28.680]  can't move it into memory that fast.
[02:28.680 --> 02:31.120]  So, that's quite significant.
[02:31.120 --> 02:33.200]  Now, what is this computational sort?
[02:33.200 --> 02:34.840]  And how does this solve this actually?
[02:34.840 --> 02:38.520]  Well, we fit a computational storage device,
[02:38.520 --> 02:41.480]  so a flash storage device with its own CPU and memory.
[02:41.480 --> 02:44.040]  And now, the user, the host processor,
[02:44.040 --> 02:47.520]  can submit small programs to this computational device,
[02:47.520 --> 02:50.360]  let it execute, and only the result data from
[02:50.360 --> 02:52.040]  this computation can then be returned
[02:52.040 --> 02:54.200]  over the interconnect into system memory,
[02:54.200 --> 02:56.120]  thereby reducing data movement
[02:56.120 --> 02:58.640]  and potentially improving energy efficiency.
[02:58.640 --> 03:01.440]  Because these lower power cores using
[03:01.440 --> 03:03.480]  more specialized hardware are typically
[03:03.480 --> 03:04.600]  more energy efficient than
[03:04.600 --> 03:07.800]  your general purpose x86 processor.
[03:07.800 --> 03:09.600]  If we then look at the state of
[03:09.600 --> 03:12.000]  current prototypes as of September 2022,
[03:12.000 --> 03:14.440]  we see three main impediments.
[03:14.440 --> 03:16.280]  Firstly, is the API between
[03:16.280 --> 03:17.600]  the host and device interface.
[03:17.600 --> 03:19.360]  There's no standardization here.
[03:19.360 --> 03:21.480]  People aren't building hardware prototypes,
[03:21.480 --> 03:24.160]  but not so much looking at the software interfaces.
[03:24.160 --> 03:26.400]  And we also have the problem of a file system,
[03:26.400 --> 03:27.680]  because these flash devices,
[03:27.680 --> 03:29.240]  they're your file systems and we want to
[03:29.240 --> 03:31.720]  keep that synchronized between the host and device.
[03:31.720 --> 03:33.360]  So, how do we achieve that?
[03:33.360 --> 03:35.480]  We can't use cache coherent interconnects
[03:35.480 --> 03:37.800]  or shared virtual memory because by the time
[03:37.800 --> 03:39.280]  we back roundtrip between
[03:39.280 --> 03:40.800]  the PCI Express interface,
[03:40.800 --> 03:42.160]  we'll have lost all the performance
[03:42.160 --> 03:43.760]  that we decide to gain.
[03:43.760 --> 03:47.160]  And how do we stick to existing interfaces?
[03:47.160 --> 03:48.920]  People that access file systems,
[03:48.920 --> 03:50.800]  they read, they write, they use system calls.
[03:50.800 --> 03:52.680]  They are very used to this.
[03:52.680 --> 03:54.520]  If you would suddenly need to link
[03:54.520 --> 03:56.880]  a shared library to access your file system,
[03:56.880 --> 03:58.440]  people wouldn't be up for that.
[03:58.440 --> 04:00.840]  So, we need some solutions here.
[04:00.840 --> 04:04.240]  That's what OpenCSD and FluffleFS introduce.
[04:04.240 --> 04:06.600]  We have a simple and intuitive system.
[04:06.600 --> 04:08.600]  All the dependencies and the software itself
[04:08.600 --> 04:10.280]  can run in user space.
[04:10.280 --> 04:12.680]  You don't need any kernel modules or things like this.
[04:12.680 --> 04:14.080]  We manage you entirely.
[04:14.080 --> 04:16.520]  We use system calls that are
[04:16.520 --> 04:18.560]  available in all operating systems,
[04:18.560 --> 04:21.280]  nor all most typical operating systems,
[04:21.280 --> 04:24.560]  FreeBSD, Windows, Mac OS, and Linux.
[04:24.560 --> 04:26.120]  So, I'd say that's pretty good.
[04:26.120 --> 04:28.440]  And we do something that's never been done
[04:28.440 --> 04:30.240]  before in computational storage.
[04:30.240 --> 04:34.040]  We allow a regular user on the host to access a file.
[04:34.040 --> 04:36.280]  Concurrently, while a kernel that is
[04:36.280 --> 04:38.640]  executing on the computational storage device
[04:38.640 --> 04:40.080]  is also accessing that file.
[04:40.080 --> 04:42.120]  And this has never been done before.
[04:42.120 --> 04:44.760]  And we managed to do this using
[04:44.760 --> 04:46.960]  existing open-source libraries.
[04:46.960 --> 04:50.360]  So, we've boost, Scenium, Fuse, UBPF, and SBDK.
[04:50.360 --> 04:53.560]  Some of you will be familiar with some of these.
[04:54.440 --> 04:57.680]  And this allows any user like you to, after this talk,
[04:57.680 --> 04:59.920]  try and experience this yourself in Camu
[04:59.920 --> 05:02.240]  without buying any additional hardware.
[05:02.240 --> 05:04.160]  And I'll get into that hardware in a second,
[05:04.160 --> 05:05.680]  because there's some specialized hardware
[05:05.680 --> 05:08.080]  that if we want to have this physically in our hands,
[05:08.080 --> 05:09.400]  we have to do some things.
[05:09.400 --> 05:11.280]  And if we look at the design,
[05:11.280 --> 05:13.280]  then we see four key components
[05:13.280 --> 05:15.840]  and a fifth one that they'll explain on the next slide.
[05:15.840 --> 05:18.400]  We're using a log-structured file system
[05:18.400 --> 05:20.400]  which supports no in-place updates.
[05:20.400 --> 05:22.400]  So, everything is appended and appended.
[05:23.400 --> 05:25.400]  And we have a module interface
[05:25.400 --> 05:27.400]  where we have backends and frontends.
[05:27.400 --> 05:30.400]  So, this allows us to experiment and try out new things.
[05:30.400 --> 05:32.400]  We can basically swap the backends
[05:32.400 --> 05:34.400]  and keep the frontend the same.
[05:34.400 --> 05:37.400]  And we're using this new technology in Flash SSDs
[05:37.400 --> 05:39.400]  that's called zone namespaces.
[05:39.400 --> 05:41.400]  They are commercially available now,
[05:41.400 --> 05:43.400]  but they're pretty hard to get still,
[05:43.400 --> 05:46.400]  but that's going to improve in the future.
[05:46.400 --> 05:50.400]  And the system calls that we managed to reuse,
[05:50.400 --> 05:52.400]  those are extended attributes.
[05:52.400 --> 05:55.400]  So, extended attributes on any file and directory
[05:55.400 --> 05:57.400]  on most file systems,
[05:57.400 --> 06:00.400]  on the file system you are using likely now,
[06:00.400 --> 06:04.400]  you can set arbitrary key value pairs on these files.
[06:04.400 --> 06:06.400]  And we can use this as a hint
[06:06.400 --> 06:08.400]  from the user to the file system
[06:08.400 --> 06:10.400]  to instruct the file system
[06:10.400 --> 06:12.400]  that something special needs to happen.
[06:12.400 --> 06:14.400]  And basically, we just reserve some keys there
[06:14.400 --> 06:17.400]  and assign special behavior to them.
[06:18.400 --> 06:21.400]  Now, let's get back to the topic of zone namespaces
[06:21.400 --> 06:24.400]  because I only use some explanation here.
[06:24.400 --> 06:26.400]  Back when we had hard drives,
[06:26.400 --> 06:28.400]  we could perform arbitrary reads and writes
[06:28.400 --> 06:30.400]  to arbitrary sectors.
[06:30.400 --> 06:32.400]  Sectors could be rewritten all the time
[06:32.400 --> 06:37.400]  without requiring any erasure beforehand.
[06:37.400 --> 06:41.400]  This is what is known as the traditional block interface.
[06:41.400 --> 06:43.400]  But there's a problem,
[06:43.400 --> 06:45.400]  and that is that NAND flash doesn't actually
[06:45.400 --> 06:47.400]  support this behavior.
[06:47.400 --> 06:49.400]  So, when you have NAND flash,
[06:49.400 --> 06:53.400]  your sectors are concentrated in blocks
[06:53.400 --> 06:56.400]  and this block needs to be linearly written.
[06:56.400 --> 07:00.400]  And before you can rewrite the information in a block,
[07:00.400 --> 07:02.400]  the block needs to be erased as a whole.
[07:02.400 --> 07:04.400]  So, in order to accommodate,
[07:04.400 --> 07:06.400]  flash SSDs have to incorporate
[07:06.400 --> 07:08.400]  what is known as a flash translation layer,
[07:08.400 --> 07:10.400]  where basically all these requests
[07:10.400 --> 07:12.400]  that go to the same sectors
[07:12.400 --> 07:15.400]  are somehow translated and put somewhere else physically,
[07:15.400 --> 07:18.400]  just so that the user can still use this same block interface
[07:18.400 --> 07:21.400]  that they have been used to from the time of hard drives.
[07:21.400 --> 07:23.400]  So, there's this physical translation
[07:23.400 --> 07:25.400]  between these logical and physical blocks,
[07:25.400 --> 07:28.400]  and when we try to synchronize the file system
[07:28.400 --> 07:31.400]  from the host with the device while a kernel is running,
[07:31.400 --> 07:34.400]  this introduces a whole lot of problems.
[07:34.400 --> 07:35.400]  So, how do we solve this?
[07:35.400 --> 07:36.400]  Now, you know the answer.
[07:36.400 --> 07:38.400]  It's the sound namespaces.
[07:38.400 --> 07:41.400]  We basically present an interface that's not the block interface,
[07:41.400 --> 07:45.400]  and it's an interface that fits to NAND flash behavior.
[07:45.400 --> 07:48.400]  So, when you use the sound namespaces SSD,
[07:48.400 --> 07:52.400]  you need, as a developer of a file system or the kernel,
[07:52.400 --> 07:55.400]  need to linearly write each sector in the block,
[07:55.400 --> 07:57.400]  and you need to erase the block as a whole.
[07:57.400 --> 08:02.400]  So, effectively, you become the manager of this SSD,
[08:02.400 --> 08:04.400]  the flash translation layer,
[08:04.400 --> 08:07.400]  and the garbage collection lifts on the host,
[08:07.400 --> 08:10.400]  and we call this whole system host-managed.
[08:10.400 --> 08:14.400]  If we now combine this with a log-structured file system,
[08:14.400 --> 08:17.400]  which also didn't have any in-place updates,
[08:17.400 --> 08:20.400]  then you naturally see that this becomes a very good fit.
[08:20.400 --> 08:23.400]  And now, together with these two technologies,
[08:23.400 --> 08:26.400]  we can finally synchronize the host and the file system,
[08:26.400 --> 08:31.400]  and we can do that by making the file temporarily immutable
[08:31.400 --> 08:33.400]  while the kernel is running.
[08:33.400 --> 08:36.400]  And we do that using a snapshot consistency model
[08:36.400 --> 08:38.400]  by creating in-memory snapshots.
[08:38.400 --> 08:41.400]  So, we were able to create a representation of the file
[08:41.400 --> 08:44.400]  as it was on the host with metadata,
[08:44.400 --> 08:47.400]  put that to the computational storage device memory,
[08:47.400 --> 08:50.400]  and we can assure that all the data that is there
[08:50.400 --> 08:54.400]  will remain immutable during the execution of the kernel.
[08:54.400 --> 08:58.400]  Meanwhile, the user can actually still write to the file,
[08:58.400 --> 09:01.400]  and the metadata of the file on the host will differ,
[09:01.400 --> 09:04.400]  but that's not a problem.
[09:04.400 --> 09:06.400]  So, this is very powerful,
[09:06.400 --> 09:10.400]  and it allows us to also understand kernel behavior in a way,
[09:10.400 --> 09:14.400]  because we can now have metadata
[09:14.400 --> 09:16.400]  and send it to the computational storage device
[09:16.400 --> 09:19.400]  that says, well, actually, if the kernel tries to do this,
[09:19.400 --> 09:21.400]  remember, it's a user-submitted program,
[09:21.400 --> 09:22.400]  it might be malicious,
[09:22.400 --> 09:24.400]  then we want to block those actions,
[09:24.400 --> 09:27.400]  so we have a security interface as well.
[09:27.400 --> 09:29.400]  The final kick in the bucket for this design
[09:29.400 --> 09:32.400]  is that we want to be architecture-independent,
[09:32.400 --> 09:34.400]  and we do that through EBPF,
[09:34.400 --> 09:36.400]  the system that you're also using for network hooks
[09:36.400 --> 09:39.400]  and event hooks in the Linux kernel nowadays.
[09:39.400 --> 09:44.400]  With EBPF, you can define system calls
[09:44.400 --> 09:46.400]  and expose those in a header,
[09:46.400 --> 09:48.400]  and this is actually the format of how you would do that,
[09:48.400 --> 09:50.400]  that's a real example,
[09:50.400 --> 09:52.400]  and the vendor would implement that code,
[09:52.400 --> 09:56.400]  and you would define in a specification some behavior,
[09:56.400 --> 09:59.400]  but the vendor doesn't have to open source their code,
[09:59.400 --> 10:02.400]  which, in the case of Flash, SSDs and vendors,
[10:02.400 --> 10:06.400]  is pretty important because they don't seem to be that keen on that,
[10:06.400 --> 10:08.400]  and this way, we can still have an interface,
[10:08.400 --> 10:11.400]  the users can write programs once
[10:11.400 --> 10:14.400]  and reuse them across all vendors without any problem,
[10:14.400 --> 10:16.400]  and the nice thing about EBPF
[10:16.400 --> 10:21.400]  is that this instruction set architecture, what EBPF essentially is,
[10:21.400 --> 10:24.400]  is easily implementable in a VM.
[10:24.400 --> 10:28.400]  So there's even pre-existing open source implementations of this,
[10:28.400 --> 10:33.400]  and that's what we're using, UBPF.
[10:33.400 --> 10:36.400]  Now that I've explained all the key components
[10:36.400 --> 10:39.400]  to OpenCSD and FluffleFS,
[10:39.400 --> 10:42.400]  I want to start with a little demo
[10:42.400 --> 10:45.400]  and show you what are some of the actual practical use cases for this.
[10:45.400 --> 10:49.400]  So how can we use such a computational storage system
[10:49.400 --> 10:53.400]  in a way that it makes sense in terms of data reduction
[10:53.400 --> 10:55.400]  and energy efficiency?
[10:55.400 --> 10:58.400]  And for that, we're going to go to the example of Shannon entropy.
[10:58.400 --> 11:01.400]  This is heavily used by file systems
[11:01.400 --> 11:03.400]  who can perform background compression
[11:03.400 --> 11:08.400]  or by just compression programs that compress in the background.
[11:08.400 --> 11:12.400]  What you basically do is you try to quantify the randomness you have in a file.
[11:12.400 --> 11:15.400]  Typically, it's between 0 and 1, but for computers,
[11:15.400 --> 11:17.400]  that doesn't really make sense.
[11:17.400 --> 11:20.400]  So we use this log b that's over here
[11:20.400 --> 11:23.400]  to normalize this for bytes.
[11:23.400 --> 11:27.400]  Then we can say what's the distribution of bytes.
[11:27.400 --> 11:31.400]  So we create, because a byte has 265 different possible values,
[11:31.400 --> 11:33.400]  we create 265 bins,
[11:33.400 --> 11:36.400]  and we submit a program to calculate this.
[11:36.400 --> 11:38.400]  It runs in the background,
[11:38.400 --> 11:43.400]  and only the result is returned to the host operating system.
[11:43.400 --> 11:46.400]  And then the host operating system is free to decide
[11:46.400 --> 11:49.400]  whether or not this file should be compressed or not.
[11:49.400 --> 11:53.400]  So how does such a kernel look like,
[11:53.400 --> 11:57.400]  the kernel that you actually submit to the computational storage device,
[11:57.400 --> 12:00.400]  or you can just write them in C and compile them with Clang.
[12:00.400 --> 12:02.400]  So you write them in C,
[12:02.400 --> 12:05.400]  and we have two individual interfaces here
[12:05.400 --> 12:08.400]  that we are exposing. The yellow commands,
[12:08.400 --> 12:11.400]  those are introduced by the system calls,
[12:11.400 --> 12:14.400]  the ebpfi that we are defining,
[12:14.400 --> 12:16.400]  and the purple ones,
[12:16.400 --> 12:20.400]  those are introduced by a file system.
[12:20.400 --> 12:25.400]  What that means is that using this system as is now,
[12:25.400 --> 12:28.400]  that it's not agnostic to the file system.
[12:28.400 --> 12:31.400]  So it is agnostic to the vendor,
[12:31.400 --> 12:33.400]  and the architecture of the vendor.
[12:33.400 --> 12:36.400]  So we have this ARM or x86, that doesn't matter,
[12:36.400 --> 12:40.400]  but now it's specific to the FluffleFS file system that we have written.
[12:40.400 --> 12:44.400]  And I will address some possible solutions for this at the end.
[12:44.400 --> 12:49.400]  Other things we need to realize is that the ebpf stack size is typically very small.
[12:49.400 --> 12:52.400]  We're talking bytes here instead of kilobytes.
[12:52.400 --> 12:55.400]  So we need a way to address this.
[12:55.400 --> 13:00.400]  So what you can do is in ubpf you can allocate a heap,
[13:00.400 --> 13:02.400]  just as your stack,
[13:02.400 --> 13:07.400]  and then we have this bpf getmem info that we have defined as part of the ABI
[13:07.400 --> 13:10.400]  that allows you to get your heap pointer.
[13:10.400 --> 13:13.400]  Now currently you have to manually offset this,
[13:13.400 --> 13:15.400]  which is a bit tedious, if you will.
[13:15.400 --> 13:18.400]  You see that that is actually done here.
[13:18.400 --> 13:22.400]  To store the bins, we offset the buffer by the sector size,
[13:22.400 --> 13:28.400]  and then the data from the sector reads is actually stored at the top of the buffer,
[13:28.400 --> 13:32.400]  and the bins are stored at the offset for precisely one sector size.
[13:32.400 --> 13:36.400]  Now when we go to look at the file system interface
[13:36.400 --> 13:41.400]  and all the helpers and data structures and additional function calls that we introduced,
[13:41.400 --> 13:46.400]  we can later see that we can also make a basic implementation of malloc and free here
[13:46.400 --> 13:48.400]  and then just resolve this.
[13:48.400 --> 13:53.400]  But for now, for this example, it's a bit tedious.
[13:53.400 --> 13:55.400]  Now how do you actually trigger this?
[13:55.400 --> 13:59.400]  So we had the extended attributes, we had all these systems in place,
[13:59.400 --> 14:03.400]  but now you just have this kernel, you have compiled it, you have stored it to a file,
[14:03.400 --> 14:07.400]  and now you want to actually offload your computation,
[14:07.400 --> 14:09.400]  well, in an emulated fashion,
[14:09.400 --> 14:12.400]  but you want to learn, you want to see how you do that.
[14:12.400 --> 14:16.400]  So the first thing you do is you call start on the kernel object.
[14:16.400 --> 14:18.400]  So this is your compiled diecode,
[14:18.400 --> 14:21.400]  and then you get the inode number.
[14:21.400 --> 14:28.400]  This inode number you have then to remember and you then open the file that you want to read upon
[14:28.400 --> 14:32.400]  or write upon, but for the examples we're using read mostly.
[14:32.400 --> 14:37.400]  Then you use set extended attribute, you use our reserved key,
[14:37.400 --> 14:40.400]  you set it to the inode number of the kernel file,
[14:40.400 --> 14:43.400]  and then when you actually issue read commands,
[14:43.400 --> 14:48.400]  the read commands will actually go to the computational storage device
[14:48.400 --> 14:51.400]  and they'll run on there.
[14:51.400 --> 14:53.400]  But when do you actually take these snapshots?
[14:53.400 --> 14:56.400]  And the trick is as soon as you set extended attributes,
[14:56.400 --> 14:58.400]  this is just by design, right?
[14:58.400 --> 15:01.400]  It could also be once you call the first read or once you execute the first write,
[15:01.400 --> 15:05.400]  but we have decided to do it at the moment that you set extended attribute.
[15:05.400 --> 15:08.400]  That means that if you make any changes to your kernel,
[15:08.400 --> 15:11.400]  once you've actually set extended attribute,
[15:11.400 --> 15:13.400]  then nothing changes anymore.
[15:13.400 --> 15:16.400]  And the same goes to the file.
[15:16.400 --> 15:20.400]  Now I want to briefly explain some different types of kernel that you can have,
[15:20.400 --> 15:26.400]  and what the example here is mainly showing is what we call a stream kernel.
[15:26.400 --> 15:30.400]  So a stream kernel happens in place of the regular read or write request.
[15:30.400 --> 15:33.400]  So the regular read or write request doesn't happen,
[15:33.400 --> 15:40.400]  only the computational storage request happens on the computational storage device.
[15:40.400 --> 15:44.400]  And with an event kernel, it's like the opposite way around.
[15:44.400 --> 15:47.400]  First, the regular event happens normally,
[15:47.400 --> 15:51.400]  and then the kernel is presented with the metadata from that request
[15:51.400 --> 15:53.400]  and can do additional things.
[15:53.400 --> 15:55.400]  This is for databases interesting.
[15:55.400 --> 15:58.400]  For example, say you're writing a big table,
[15:58.400 --> 16:01.400]  and you want to know the average or the minimum or the maximum,
[16:01.400 --> 16:05.400]  and you want to emit that as metadata at the end of your table write.
[16:05.400 --> 16:09.400]  While you could use an event kernel to let it write as is,
[16:09.400 --> 16:12.400]  then you get presented with the data,
[16:12.400 --> 16:15.400]  and the kernel runs on the computational storage device,
[16:15.400 --> 16:17.400]  and you emit the metadata after,
[16:17.400 --> 16:21.400]  and you can store that as like an index.
[16:21.400 --> 16:27.400]  We have also decided to isolate the context of this computational storage offloading,
[16:27.400 --> 16:32.400]  so what is considered, once you set the attribute, by PID.
[16:32.400 --> 16:35.400]  But we also could make this by file handle,
[16:35.400 --> 16:38.400]  or you could even set it for the whole line node.
[16:38.400 --> 16:44.400]  More so, we could use specific keys for file handle PID or I node offloading,
[16:44.400 --> 16:49.400]  so it's just a matter of semantics here.
[16:49.400 --> 16:55.400]  Now, I have some source code in Python of these execution steps that I've just shown here,
[16:55.400 --> 16:59.400]  because there's a little bit of details that I left out in the brief overview.
[16:59.400 --> 17:03.400]  The first is that you have to stride your requests,
[17:03.400 --> 17:06.400]  and those have to be strided by 500 to 12k.
[17:06.400 --> 17:08.400]  Why is this so?
[17:08.400 --> 17:12.400]  Well, infuse the amount of kernel pages that are allocated to move data between the kernel
[17:12.400 --> 17:15.400]  and the user space is statically fixed.
[17:15.400 --> 17:20.400]  So if you go over this, then your request will seem filing from the user perspective,
[17:20.400 --> 17:23.400]  but what the kernel will do is it will chop up your requests.
[17:23.400 --> 17:25.400]  Why is that problematic?
[17:25.400 --> 17:28.400]  Well, then multiple kernels spawn,
[17:28.400 --> 17:30.400]  because from the context of the file system,
[17:30.400 --> 17:32.400]  every time it sees a read or write request,
[17:32.400 --> 17:40.400]  it will go to the kernel and move it to the computational storage device.
[17:40.400 --> 17:44.400]  Then here you can see how I set the extended attribute and get the kernel,
[17:44.400 --> 17:46.400]  the I node number,
[17:46.400 --> 17:52.400]  and what I want to show here at the bottom is that I'm getting 265 integers,
[17:52.400 --> 17:56.400]  and that's for each of the buckets of the entropy read,
[17:56.400 --> 17:59.400]  but I'm having a request of 512k.
[17:59.400 --> 18:04.400]  So that shows you the amount of data reduction that you can achieve using systems like this.
[18:04.400 --> 18:08.400]  265 integers, 512k.
[18:08.400 --> 18:09.400]  Pretty good.
[18:09.400 --> 18:10.400]  Could be better though.
[18:10.400 --> 18:15.400]  The reason it's not better is floating point support in EBPF is limited
[18:15.400 --> 18:19.400]  to the fact where you need to implement fixed point match yourself.
[18:19.400 --> 18:22.400]  So we could do this as part of the file system helpers,
[18:22.400 --> 18:28.400]  but that's not done for this prototype at the moment.
[18:28.400 --> 18:29.400]  Now, some limitations.
[18:29.400 --> 18:31.400]  This was a master thesis work.
[18:31.400 --> 18:34.400]  This was my first time defining a file system ever.
[18:34.400 --> 18:36.400]  It's solely a proof of concept.
[18:36.400 --> 18:41.400]  There's no garbage collection, no deletion, no space reclaiming.
[18:41.400 --> 18:44.400]  Please don't use it.
[18:44.400 --> 18:46.400]  Please don't use it to store your files.
[18:46.400 --> 18:47.400]  Yeah.
[18:47.400 --> 18:51.400]  EBPF has an ending in this, just like any ISA would have,
[18:51.400 --> 18:53.400]  and there's currently no conversions.
[18:53.400 --> 18:56.400]  So if you happen to use something that uses different ending in this,
[18:56.400 --> 18:58.400]  all your data will be upside down.
[18:58.400 --> 19:01.400]  So you have to deal with that yourself for now,
[19:01.400 --> 19:05.400]  but once again, we can make it part of the file system helpers
[19:05.400 --> 19:08.400]  to help with these data structure layout conversions
[19:08.400 --> 19:10.400]  and the engineers conversions.
[19:10.400 --> 19:12.400]  As I mentioned briefly earlier,
[19:12.400 --> 19:15.400]  floating point support in EBPF is practically non-existent,
[19:15.400 --> 19:19.400]  but we can implement fixed point match.
[19:19.400 --> 19:24.400]  And currently, I haven't shown any performance examples
[19:24.400 --> 19:27.400]  because I don't think that they are that interesting
[19:27.400 --> 19:31.400]  because what's currently happening when you emulate offloading
[19:31.400 --> 19:35.400]  is that it just runs on the host processor as is in EBPF.
[19:35.400 --> 19:38.400]  So it isn't representative of the microcontrollers
[19:38.400 --> 19:40.400]  that you would find on SSDs.
[19:40.400 --> 19:43.400]  So the runtime, the time that it would take to execute these kernels
[19:43.400 --> 19:45.400]  would be much too fast.
[19:45.400 --> 19:48.400]  So that's something that we need to deal on, I think,
[19:48.400 --> 19:51.400]  because then we can more easily reason about
[19:51.400 --> 19:54.400]  what would be the actual performance if we would offload
[19:54.400 --> 19:56.400]  these applications to SSDs.
[19:56.400 --> 20:00.400]  Frankly, these SSDs do have very capable microcontrollers,
[20:00.400 --> 20:02.400]  typically even multi-core processors.
[20:02.400 --> 20:05.400]  The reason they do that is because they need to manage
[20:05.400 --> 20:07.400]  your flash sensations layer.
[20:07.400 --> 20:10.400]  So they are already fairly capable devices, actually.
[20:10.400 --> 20:13.400]  Only read stream kernels have been fully implemented
[20:13.400 --> 20:15.400]  for this prototype as well.
[20:15.400 --> 20:18.400]  And that's mainly because event kernel performance
[20:18.400 --> 20:22.400]  is problematic because the data from the event kernel,
[20:22.400 --> 20:25.400]  remember the IO request happens regularly,
[20:25.400 --> 20:29.400]  so all the data is moved back into the host processor
[20:29.400 --> 20:32.400]  and only then is the event kernel started.
[20:32.400 --> 20:35.400]  But what you really need is a two-stage system
[20:35.400 --> 20:39.400]  where you prevent the data being moved back from the host.
[20:39.400 --> 20:42.400]  This requires some more tinkering.
[20:42.400 --> 20:45.400]  And the final thing, we need to make this agnostic
[20:45.400 --> 20:49.400]  to the file system. And we can very easily achieve this
[20:49.400 --> 20:52.400]  using this file system runtime,
[20:52.400 --> 20:55.400]  where to an ICD, an installable client driver,
[20:55.400 --> 20:59.400]  much the same way that Falcon and OpenCL and OpenGL are working,
[20:59.400 --> 21:01.400]  you can dynamically load a shared library
[21:01.400 --> 21:05.400]  that implements all the functions you have defined in the header.
[21:05.400 --> 21:08.400]  And this can also dynamically compile your programs
[21:08.400 --> 21:12.400]  and then store the cache versions of this program.
[21:12.400 --> 21:15.400]  And using StataFS, we can easily identify
[21:15.400 --> 21:17.400]  on what file system is running.
[21:17.400 --> 21:20.400]  And that allows users to write their programs one,
[21:20.400 --> 21:24.400]  run on any architecture and for any computational file system,
[21:24.400 --> 21:28.400]  which I think is pretty powerful and flexible.
[21:28.400 --> 21:32.400]  So that's it. I encourage you to try this.
[21:32.400 --> 21:34.400]  I've also written a thesis on this
[21:34.400 --> 21:36.400]  that does have some performance metrics.
[21:36.400 --> 21:39.400]  It also shows you some interesting data structures
[21:39.400 --> 21:41.400]  that we had to design for the file system
[21:41.400 --> 21:44.400]  to be able to support these in-memory snapshots.
[21:44.400 --> 21:47.400]  There's a previous work called ZCSD
[21:47.400 --> 21:51.400]  that also has some early performance information.
[21:51.400 --> 21:54.400]  And I've written quite an extensive survey
[21:54.400 --> 21:57.400]  on the last decade's history or so
[21:57.400 --> 21:59.400]  of computational flash storage devices,
[21:59.400 --> 22:01.400]  which also quite interesting.
[22:01.400 --> 22:03.400]  So thank you.
[22:03.400 --> 22:12.400]  APPLAUSE
[22:12.400 --> 22:14.400]  Seven minutes for questions.
[22:14.400 --> 22:16.400]  Oh, that's quite good.
[22:19.400 --> 22:21.400]  I imagine this is quite difficult, right?
[22:21.400 --> 22:23.400]  Computational storage, what the fuck's that?
[22:23.400 --> 22:27.400]  So please don't hesitate to ask questions
[22:27.400 --> 22:29.400]  if anything is unclear.
[22:29.400 --> 22:34.400]  What's the availability of hardware that can do this?
[22:34.400 --> 22:36.400]  The computational storage?
[22:36.400 --> 22:38.400]  Yes, the computational storage.
[22:38.400 --> 22:43.400]  There is one vendor that is selling a computational storage device
[22:43.400 --> 22:46.400]  that's not based on zoned namespaces storage.
[22:46.400 --> 22:49.400]  So it's using conventional SSDs
[22:49.400 --> 22:53.400]  and it supports computational storage to a network interface.
[22:53.400 --> 22:55.400]  So you have the normal PCIe interface
[22:55.400 --> 23:00.400]  and then there's this transport over need to do TCPIP
[23:00.400 --> 23:03.400]  and then you basically just connect over it to SSH
[23:03.400 --> 23:05.400]  and then you can do things on the SSD.
[23:05.400 --> 23:07.400]  That one's commercially available.
[23:07.400 --> 23:12.400]  I don't know what they would ask for that product.
[23:12.400 --> 23:16.400]  What does ZCSD have to do with zoned namespaces?
[23:16.400 --> 23:18.400]  Nothing in principle,
[23:18.400 --> 23:20.400]  but you need a way to synchronize the file system
[23:20.400 --> 23:22.400]  between the host and device
[23:22.400 --> 23:24.400]  and zoned namespaces make that trivial,
[23:24.400 --> 23:26.400]  whereas conventional SSDs,
[23:26.400 --> 23:29.400]  the logical and physical block translation,
[23:29.400 --> 23:31.400]  severely hinders this,
[23:31.400 --> 23:34.400]  makes it extremely difficult to perform.
[23:37.400 --> 23:40.400]  So why didn't you include the performance projects
[23:40.400 --> 23:43.400]  from your pieces or better ones?
[23:43.400 --> 23:45.400]  Because the performance...
[23:45.400 --> 23:46.400]  Oh, sorry.
[23:46.400 --> 23:47.400]  Oh, yeah.
[23:47.400 --> 23:48.400]  Why don't I...
[23:48.400 --> 23:49.400]  Very good.
[23:49.400 --> 23:51.400]  I forgot that actually all the time.
[23:51.400 --> 23:55.400]  Yeah, so why didn't I include any performance metrics
[23:55.400 --> 23:57.400]  if I have them?
[23:57.400 --> 24:00.400]  And the answer is because I don't think I would have time
[24:00.400 --> 24:03.400]  and I don't think they're interesting enough to include.
[24:03.400 --> 24:05.400]  This is a very complicated subject.
[24:05.400 --> 24:07.400]  It's very new for most people, computational search.
[24:07.400 --> 24:09.400]  Most people have never heard of it.
[24:09.400 --> 24:12.400]  So I much rather spend the time to explain this properly
[24:12.400 --> 24:15.400]  and try to show you that this is a very interesting concept
[24:15.400 --> 24:17.400]  to solve this bandwidth gap
[24:17.400 --> 24:18.400]  rather than show you some metrics
[24:18.400 --> 24:20.400]  that are not representative anyway
[24:20.400 --> 24:22.400]  because the kernel is running on the host CPU
[24:22.400 --> 24:24.400]  and you're not going to have an additional host CPU
[24:24.400 --> 24:26.400]  on the Flash SSD.
[24:28.400 --> 24:31.400]  Can you talk about what kind of test setup
[24:31.400 --> 24:33.400]  you have for your metric?
[24:33.400 --> 24:34.400]  So I don't...
[24:34.400 --> 24:36.400]  Yeah, of the metrics themselves.
[24:36.400 --> 24:37.400]  Yeah, so the framework...
[24:37.400 --> 24:39.400]  Okay, yeah, yeah, very good.
[24:39.400 --> 24:41.400]  What kind of test setup I had
[24:41.400 --> 24:43.400]  to do all these analyses
[24:43.400 --> 24:45.400]  and to try these things out.
[24:45.400 --> 24:47.400]  So I run Camu on my own host machine,
[24:47.400 --> 24:49.400]  just a normal laptop, basically this one.
[24:49.400 --> 24:52.400]  And Camu then creates a virtual
[24:52.400 --> 24:54.400]  sound namespaces device
[24:54.400 --> 24:56.400]  that's actually quite recently introduced to Camu.
[24:56.400 --> 24:58.400]  So you can now try sound namespaces
[24:58.400 --> 25:00.400]  without owning sound namespaces.
[25:00.400 --> 25:02.400]  That's the whole reason Camu comes into play
[25:02.400 --> 25:05.400]  because otherwise people wouldn't need
[25:05.400 --> 25:07.400]  to buy a sound namespaces SSD
[25:07.400 --> 25:09.400]  which is quite badly available.
[25:09.400 --> 25:12.400]  And then you just run the prototype as is.
[25:12.400 --> 25:14.400]  So that's all you need.
[25:14.400 --> 25:17.400]  And you really don't need any special hardware.
[25:17.400 --> 25:20.400]  Yeah, it could be even on an ARM laptop.
[25:20.400 --> 25:22.400]  It doesn't matter.
[25:22.400 --> 25:23.400]  Did you test it?
[25:23.400 --> 25:25.400]  No, I did not test it.
[25:25.400 --> 25:28.400]  But whether or not I tested if it works on ARM.
[25:28.400 --> 25:30.400]  The answer is no, I did not test it.
[25:30.400 --> 25:32.400]  But I'm pretty sure Camu compiles some ARM.
[25:32.400 --> 25:34.400]  So I'm pretty sure we're good there.
[25:34.400 --> 25:36.400]  Because you have to remember
[25:36.400 --> 25:39.400]  that that's maybe not intrinsically clear
[25:39.400 --> 25:40.400]  from this presentation.
[25:40.400 --> 25:42.400]  But we didn't extend Camu in any way.
[25:42.400 --> 25:44.400]  It's just a normal Camu cumulation.
[25:44.400 --> 25:46.400]  You don't even need to custom install it.
[25:46.400 --> 25:48.400]  You can just get it from the package manager
[25:48.400 --> 25:50.400]  and use this.
[25:50.400 --> 25:52.400]  I have a lot of questions about that.
[25:52.400 --> 25:55.400]  Regarding the computational part,
[25:55.400 --> 25:59.400]  what are the limitations of what kind of CPU or kernel
[25:59.400 --> 26:02.400]  that it may run on these devices?
[26:02.400 --> 26:04.400]  I think the main limitations,
[26:04.400 --> 26:06.400]  what are the limitations as the kernels
[26:06.400 --> 26:08.400]  that you run on these devices?
[26:08.400 --> 26:12.400]  Well, first of all, you need to have data reduction, right?
[26:12.400 --> 26:15.400]  If you're going to read one gigabyte from the flash storage
[26:15.400 --> 26:18.400]  and you're going to return one gigabyte of data to the host,
[26:18.400 --> 26:20.400]  then there's no real point in offloading this
[26:20.400 --> 26:23.400]  because the data is going to be moved anyway.
[26:23.400 --> 26:26.400]  So the first thing that the limitation is
[26:26.400 --> 26:28.400]  that you have to find an application
[26:28.400 --> 26:30.400]  that is reductive in nature.
[26:30.400 --> 26:33.400]  Once you do the computation, you return less data.
[26:33.400 --> 26:38.400]  The nice thing is that's 99% of all workloads, right?
[26:38.400 --> 26:40.400]  So that's pretty good.
[26:40.400 --> 26:44.400]  And the second thing is that if it's timing critical
[26:44.400 --> 26:46.400]  and the computation takes a long time,
[26:46.400 --> 26:48.400]  then it's probably not that interesting
[26:48.400 --> 26:51.400]  because the latency will then be too bad
[26:51.400 --> 26:53.400]  because the performance of these cores
[26:53.400 --> 26:55.400]  is much less than your host processor.
[26:55.400 --> 26:59.400]  But you can implement specialized instructions
[26:59.400 --> 27:01.400]  that could be very efficient
[27:01.400 --> 27:03.400]  in doing database filtering or things like this.
[27:03.400 --> 27:06.400]  And that is where the whole ASIC and FPGA part
[27:06.400 --> 27:08.400]  would come into play.
[27:08.400 --> 27:12.400]  But if it's not timing critical and it's in the background,
[27:12.400 --> 27:14.400]  like the Shandom entropy compression,
[27:14.400 --> 27:16.400]  those are ideal cases.
[27:16.400 --> 27:20.400]  Reduction in data and not timing critical.
[27:20.400 --> 27:23.400]  So what you mean is we can have software kernels
[27:23.400 --> 27:26.400]  with the back end in hardware
[27:26.400 --> 27:29.400]  so we can also program the thermo.
[27:29.400 --> 27:34.400]  So like maybe a core like CPU board or GPU board.
[27:34.400 --> 27:36.400]  To repeat the question,
[27:36.400 --> 27:38.400]  whether or not it's just software
[27:38.400 --> 27:41.400]  or whether we also program the hardware.
[27:41.400 --> 27:44.400]  Of course, FPGAs can be reprogrammed on the fly
[27:44.400 --> 27:46.400]  and we have seen prototypes in the past
[27:46.400 --> 27:49.400]  for computational storage devices where they do just that.
[27:49.400 --> 27:52.400]  From the host device, the user sends a bit stream
[27:52.400 --> 27:54.400]  that dynamically reprograms the FPGA
[27:54.400 --> 27:56.400]  and then the kernel starts running.
[27:56.400 --> 27:58.400]  That's not what we're trying to achieve here.
[27:58.400 --> 28:01.400]  What I envision in this is that the FPGA
[28:01.400 --> 28:06.400]  has specialized logic to do certain computations
[28:06.400 --> 28:10.400]  and then from the ABI, from the EBPF ABI,
[28:10.400 --> 28:12.400]  whether code triggers those instructions
[28:12.400 --> 28:15.400]  will utilize the FPGA to do those computations
[28:15.400 --> 28:19.400]  but they would be defined in the specification beforehand.
[28:19.400 --> 28:22.400]  Because typically in reflashing a FPGA
[28:22.400 --> 28:24.400]  with a new bit stream takes quite some time
[28:24.400 --> 28:26.400]  so in the interest of performance
[28:26.400 --> 28:29.400]  it might not be that interesting.
[28:34.400 --> 28:36.400]  So I'm going to ask a question.
[28:36.400 --> 28:43.400]  You might have mentioned it but are there close source competitors?
[28:43.400 --> 28:45.400]  If there are close source competitors
[28:45.400 --> 28:47.400]  in the space of computational storage.
[28:47.400 --> 28:49.400]  Well, actually that's one of the things
[28:49.400 --> 28:51.400]  that's been growing really well in this scene.
[28:51.400 --> 28:55.400]  I'd say the vast majority of everything is open source.
[28:55.400 --> 29:00.400]  At least if you look at the recent things,
[29:00.400 --> 29:03.400]  if you look at the past decade then it's a bit worse
[29:03.400 --> 29:05.400]  because there is a lot of research published
[29:05.400 --> 29:07.400]  that doesn't actually publish the source code
[29:07.400 --> 29:09.400]  or rather the source code is published
[29:09.400 --> 29:11.400]  but everything is a hardware prototype
[29:11.400 --> 29:13.400]  and they didn't publish the bit streams
[29:13.400 --> 29:15.400]  or the FHDL or the Farrellock
[29:15.400 --> 29:17.400]  so you're then stuck as well
[29:17.400 --> 29:19.400]  or they didn't have any PCB designs
[29:19.400 --> 29:22.400]  so you can't reproduce the work if you will.
[29:22.400 --> 29:25.400]  I say this is a much bigger problem
[29:25.400 --> 29:28.400]  than just computational storage in the field of academia
[29:28.400 --> 29:31.400]  but it's also present here.
[29:31.400 --> 29:36.400]  Yes.
[29:36.400 --> 29:39.400]  Which one?
[29:39.400 --> 29:41.400]  The Python code.
[29:49.400 --> 29:51.400]  Complexity in terms of?
[29:51.400 --> 30:01.400]  The reason this is a nested loop,
[30:01.400 --> 30:04.400]  in the phase of performance I have a nested loop here
[30:04.400 --> 30:07.400]  so why that and why in the terms of performance how?
[30:07.400 --> 30:09.400]  And Python.
[30:09.400 --> 30:13.400]  The trick is this is just for demonstration purposes
[30:13.400 --> 30:15.400]  that's one you can easily make this example
[30:15.400 --> 30:18.400]  in C or C++ and you shit if you care about performance.
[30:18.400 --> 30:21.400]  The trick is that this program is already spending 99%
[30:21.400 --> 30:22.400]  of its time in IO weight
[30:22.400 --> 30:24.400]  because it's waiting for the kernel to complete
[30:24.400 --> 30:26.400]  so in the phase of that it's not that interesting
[30:26.400 --> 30:29.400]  and the reason we have a nested loop
[30:29.400 --> 30:32.400]  is because the floating point performance in EBP,
[30:32.400 --> 30:36.400]  the floating point in EBPF is not existent
[30:36.400 --> 30:38.400]  or at least I didn't implement a fixed point mod
[30:38.400 --> 30:40.400]  so what I have to do after this
[30:40.400 --> 30:42.400]  at the bottom what you don't see here
[30:42.400 --> 30:45.400]  is that from all these buckets of these bins
[30:45.400 --> 30:48.400]  I'm actually computing the distribution
[30:48.400 --> 30:51.400]  using floating point math in Python
[30:51.400 --> 30:54.400]  which is why I don't get a single number from this kernel
[30:54.400 --> 30:58.400]  because if I would have floating point implementation
[30:58.400 --> 31:03.400]  in EBPF I could already do that computation in EBPF
[31:03.400 --> 31:07.400]  and only return a single 32-bit float as a result
[31:07.400 --> 31:10.400]  instead of these 265 integers.
[31:10.400 --> 31:14.400]  But I still, the reason this is a loop
[31:14.400 --> 31:17.400]  is because I still have to strike for the read request
[31:17.400 --> 31:19.400]  because I can't go above 5 on the 12K
[31:19.400 --> 31:22.400]  even if my file is bigger than 5 on the 12K.
[31:22.400 --> 31:26.400]  You said it's spending a lot of time in IO weight.
[31:26.400 --> 31:29.400]  Couldn't you just write it there just to prove it
[31:29.400 --> 31:32.400]  or it doesn't make any sense in this case?
[31:32.400 --> 31:36.400]  Well, the trick is, okay,
[31:36.400 --> 31:40.400]  couldn't I implement multi-treading here?
[31:40.400 --> 31:45.400]  Currently, the EBPFVM runs as a single process
[31:45.400 --> 31:48.400]  so even if you submit multiple kernels
[31:48.400 --> 31:50.400]  only one will execute at a time.
[31:50.400 --> 31:53.400]  Why? It's a thesis prototype, right?
[31:53.400 --> 31:55.400]  Time, things like this.
[31:55.400 --> 31:57.400]  Okay.
[31:57.400 --> 31:58.400]  Thank you very much.
[31:58.400 --> 32:10.400]  No worries.