[00:00.000 --> 00:10.520]  So, once again, hello everybody, welcome to my talk.
[00:10.520 --> 00:19.680]  This talk is going to be about OSV, evolution of OSV towards greater modularity and composability.
[00:19.680 --> 00:21.800]  Thanks for introducing me.
[00:21.800 --> 00:27.360]  So I've been contributing to OSV since 2016.
[00:27.360 --> 00:35.640]  Here I go, in 2015 I heard about OSV in one of the conferences and then a couple of years
[00:35.640 --> 00:42.480]  later I was nominated to be one of its committers and my greatest contributions to OSV include
[00:42.480 --> 00:53.200]  making OSV run on Firecracker and significantly improving ARCH 64 port among other things.
[00:53.200 --> 00:59.600]  So I'm not sure if you can tell it but OSV is actually my hobby so I'm not like a real
[00:59.600 --> 01:05.800]  current developer like many of previous speakers are so it's actually, you know, I work on
[01:05.800 --> 01:13.800]  it in my night when I feel and I have a day job so I don't represent my company that I
[01:13.800 --> 01:24.920]  work for so this is all my personal contribution to the project.
[01:24.920 --> 01:32.040]  So in today's presentation I will talk about enhancements introduced by the latest release
[01:32.040 --> 01:42.040]  of OSV 057 with the focus on greater modularity and composability but I will also discuss
[01:42.040 --> 01:51.920]  other interesting enhancements like lazy stack, novel ways to build ZFS images and improvements
[01:51.920 --> 01:54.760]  to the ARM port.
[01:54.760 --> 02:03.400]  Finally I will also cover an interesting use case of OSV, seaweed FS running on OSV which
[02:03.400 --> 02:07.080]  is a distributed file system.
[02:07.080 --> 02:16.280]  So as you can see in this talk besides the title, modularity, I will actually try to
[02:16.280 --> 02:23.000]  give you like state of the art where OSV is, how it has changed recently and a little bit
[02:23.000 --> 02:28.400]  of where it's going hopefully.
[02:28.400 --> 02:35.000]  So I know there are probably many definitions of unique kernels and each of them is a little
[02:35.000 --> 02:41.080]  bit different right but so I'm sure most of you understand what unique kernels are but
[02:41.080 --> 02:47.160]  just a quick recap with emphasis on how OSV is a little bit different.
[02:47.160 --> 02:52.920]  So OSV is a unique kernel that was designed to run single and modified Linux application
[02:52.920 --> 02:58.760]  on top of hypervisor whereas traditional operating system were originally designed to run on
[02:58.760 --> 03:02.160]  a vast range of physical machines.
[03:02.160 --> 03:10.240]  But simply speaking OSV is an OS designed to run single application without isolation
[03:10.240 --> 03:18.800]  between application and kernel or it can be thought as a way to run highly isolated process
[03:18.800 --> 03:24.080]  without ability to make system calls to the host OS.
[03:24.080 --> 03:34.760]  Finally OSV can run on both 64-bit X86 and ARM V8 architectures.
[03:34.760 --> 03:44.800]  Now a little bit of history, so OSV for those that don't know, OSV was started in late 2012
[03:44.800 --> 03:54.800]  by the company called Cloud Use Systems and they built pretty strong team of 10, 20 developers
[03:54.800 --> 03:56.000]  I think.
[03:56.000 --> 04:04.080]  I wasn't one of them but they pretty much wrote most of OSV but at some point they basically
[04:04.080 --> 04:11.000]  I guess realized they have to make money I'm guessing so they basically moved on and started
[04:11.000 --> 04:18.280]  working on this product you may have know CillaDB which is this high-performance database
[04:18.280 --> 04:25.600]  but I think they took some learning so and after that basically I think OSV did receive
[04:25.600 --> 04:30.440]  some grant from European Union so there was some project on that and I think there may
[04:30.440 --> 04:39.000]  have been some companies also using OSV but honestly since then it's been really maintained
[04:39.000 --> 04:49.760]  by volunteers so like me like there's still some people from CillaDB, Nadaf, Harrell and
[04:49.760 --> 04:55.400]  others that contribute to the project you know I would just single out Fortisks and
[04:55.400 --> 05:02.280]  Akis which actually was the one that implemented Virtio FS as a you know for very interesting
[05:02.280 --> 05:07.440]  contribution to OSV and obviously I would like to take this opportunity to invite more
[05:07.440 --> 05:16.880]  people to become part of our community because honestly you may not realize it but our community
[05:16.880 --> 05:25.960]  is very small so it's just really me, Nadaf and a couple of other people that contribute
[05:25.960 --> 05:34.800]  to the project so I hope we you know we're gonna grow as a community after this talk.
[05:34.800 --> 05:42.560]  So quick recap of a little bit of how OSV looks like what the design is so in this slide
[05:42.560 --> 05:50.240]  you can see major components of OSV across layers starting with G-Lipsy, the top which
[05:50.240 --> 05:57.960]  is greatly based actually on Musil, then core layer in the middle comprised of ELF dynamic
[05:57.960 --> 06:08.240]  linker of VFS, virtual file system, networking stack, thread scheduler, page cache, RCU,
[06:08.240 --> 06:17.000]  read copy update, page table management and L1, L2 pools to manage memory and then you
[06:17.000 --> 06:27.840]  have a layer of device drivers where we OSV implements Virtio devices on both of our
[06:27.840 --> 06:38.160]  PCI transport and MMIO transport and then Zen and VMware among others and obviously
[06:38.160 --> 06:45.800]  and one more thing so as we can run on KVM based hypervisors like QMU like Firecracker
[06:45.800 --> 06:54.880]  I did test also OSV on cloud hypervisor which is I think Intel's hypervisor written in
[06:54.880 --> 07:03.560]  Rust and then I personally didn't really run OSV on Zen so I know that the Zen support
[07:03.560 --> 07:07.800]  is a little bit dated probably and I'm not sure how much it has been tested.
[07:07.800 --> 07:15.360]  I did test on VMware Vbox, virtual box and I think on Hyperkit at some point.
[07:15.360 --> 07:22.200]  So I will just I want to go into more detail about this diagram but I will leave it with
[07:22.200 --> 07:28.680]  you just as a reference later.
[07:28.680 --> 07:37.240]  So in the first part of this presentation I will about modularity and composability
[07:37.240 --> 07:45.440]  I will focus on new experimental modes to hide the non-Gilipsi symbols and standard C++
[07:45.440 --> 07:46.440]  library.
[07:46.440 --> 07:53.160]  I will also discuss how ZFS code was extracted out of the kernel in form of a dynamically
[07:53.160 --> 08:01.760]  linked library and finally I will also explain another new build option to tailor the kernel
[08:01.760 --> 08:03.800]  to a set of specific drivers.
[08:03.800 --> 08:10.480]  I call them driver profiles and another new mechanism to allow building a version of kernel
[08:10.480 --> 08:17.080]  with a subset of Gilipsi symbols needed to support a specific application which I think
[08:17.080 --> 08:23.880]  is quite interesting.
[08:23.880 --> 08:32.440]  So by design OSV has always been a FAT unicolonal and which has been some sort of some of the
[08:32.440 --> 08:39.600]  criticism and by default provided a large subset of Gilipsi functionality has included
[08:39.600 --> 08:47.600]  full standard C++ library and ZFS implementation drivers for many devices and has supported
[08:47.600 --> 08:49.800]  many hypervisors.
[08:49.800 --> 08:58.120]  So on one hand it makes running arbitrary application on any hypervisor very easy using
[08:58.120 --> 09:00.800]  a single universal kernel.
[09:00.800 --> 09:07.440]  But on another hand such universality comes with the price of bloated kernel with many
[09:07.440 --> 09:13.400]  symbols and drivers and possibly ZFS that is unused.
[09:13.400 --> 09:23.040]  That's causing inefficient memory usage, longer boot time and potential security vulnerabilities.
[09:23.040 --> 09:30.800]  In addition C++ application linked against one version of LeapSTD C++ different than
[09:30.800 --> 09:35.840]  the version the kernel was linked against may simply not work.
[09:35.840 --> 09:45.440]  For example that happened to me when I was testing OSV with.NET and the only way to
[09:45.440 --> 09:55.640]  make it work was to hide basically the C++ standard library and use the one that was
[09:55.640 --> 10:03.760]  part of the.NET app.
[10:03.760 --> 10:10.120]  So one way to lower memory utilization of the guest is to minimize the kernel size.
[10:10.120 --> 10:17.120]  By default OSV comes with a universal kernel that provides quite large spectrum of Gilipsi
[10:17.120 --> 10:25.920]  library and full standard C++ library and exposes over a total of 17,000 symbols and
[10:25.920 --> 10:33.400]  most of those are very long as C++ symbols that make up the symbol table.
[10:33.400 --> 10:41.000]  So the question may be posed why not have a mechanism where we can build a kernel with
[10:41.000 --> 10:50.080]  all known Gilipsi symbols hidden and all unneeded code that is unused garbage collected.
[10:50.080 --> 10:56.240]  So the extra benefit of fewer exported symbols is increased security that stems from the
[10:56.240 --> 11:05.360]  fact that there is simply less potential code that is left that could be harmful.
[11:05.360 --> 11:15.040]  And also that way we can achieve better compatibility as any potential symbol collisions for example
[11:15.040 --> 11:26.680]  and mismatch standard C++ library which I mentioned can be avoided.
[11:26.680 --> 11:37.680]  So the release 057 added a new build option called conf-hide symbols to hide those non-Gilipsi
[11:37.680 --> 11:42.480]  symbols and the standard C++ library symbols.
[11:42.480 --> 11:50.400]  These are enabled if enabled in essence most files in a source tree of OSV except the ones
[11:50.400 --> 11:56.840]  under Lipsi and Musil directories would be compiled with the flags of visibility hidden
[11:56.840 --> 12:02.080]  and only if that build flag is enabled.
[12:02.080 --> 12:08.640]  On the other hand the symbols to be exposed as public like the Gilipsi one would be annotated
[12:08.640 --> 12:16.120]  with OSV Asterisk API macros that translate basically to attribute visibility default
[12:16.120 --> 12:22.760]  and the standard C++ library is linked with the flag no whole archive.
[12:22.760 --> 12:30.280]  Those SV Asterisk API macros basically would be like OSV Lipsi API or OSV Pthreads API
[12:30.280 --> 12:38.840]  OSV Lipsi API and so on basically that match all then I think around 10 libraries that OSV
[12:38.840 --> 12:43.360]  dynamic linker exposes.
[12:43.360 --> 12:53.080]  Finally the list of public symbols exported by the kernel is enforced during the build
[12:53.080 --> 13:00.920]  process based on the symbol list files for each advertised library like for example Lipsi
[13:00.920 --> 13:08.360]  SO6 and is maintained under the directory exported symbols.
[13:08.360 --> 13:14.680]  So these files are basically list of symbols that are concatenated using the script called
[13:14.680 --> 13:20.800]  generate version script and which goes into version script file and then is fed to the
[13:20.800 --> 13:28.280]  linker as an argument to the version script file.
[13:28.280 --> 13:35.600]  So in order to now remove all unedited code basically garbage all files would be compiled
[13:35.600 --> 13:43.000]  with the function sections and data sections and then they would be linked with the flag
[13:43.000 --> 13:44.800]  GC section.
[13:44.800 --> 13:51.560]  Now any code that needs to stay like for example the bootstrap start point or dynamically
[13:51.560 --> 14:01.920]  enabled code like the optimal memcpy implementation or trace point patch size is retained by
[14:01.920 --> 14:11.600]  putting relevant kept directives and relevant sections in the linker script.
[14:11.600 --> 14:21.400]  The kernel L file built with most symbols hidden is roughly 4.3 megabytes in size compared
[14:21.400 --> 14:27.280]  to 6.7 which is reduction of around 40%.
[14:27.280 --> 14:33.040]  This great reduction stems from the fact that the standard library standard C++ library
[14:33.040 --> 14:36.440]  is no longer linked with whole archive.
[14:36.440 --> 14:43.840]  The symbol table is way smaller and unused code is garbage collected.
[14:43.840 --> 14:52.680]  Please note that the resulting kernel is still universal as it exports all glipsy symbols
[14:52.680 --> 14:56.320]  and includes all the device drivers.
[14:56.320 --> 15:03.360]  And as a result of this size reduction kernel boots also a little bit faster.
[15:03.360 --> 15:11.560]  Well this all sounds great so one may ask why not hide most symbols and standard C++
[15:11.560 --> 15:13.720]  library by default.
[15:13.720 --> 15:22.240]  The problem is that there are around 35 unit tests and some also applications that were
[15:22.240 --> 15:33.320]  written in the past that rely on C++ symbols and they basically would not run if we hide
[15:33.320 --> 15:35.680]  all of those symbols.
[15:35.680 --> 15:45.120]  And those are basically used to, they were implemented in the past and it was done sometimes
[15:45.120 --> 15:48.760]  out of convenience, sometimes basically out of necessity.
[15:48.760 --> 15:57.080]  So to address this specific problem we will need to expose some of those OSVC++ symbols
[15:57.080 --> 16:03.480]  as the API expressed in C.
[16:03.480 --> 16:14.640]  So we'll basically define very simple C wrapper functions that we'll call those C++ code.
[16:14.640 --> 16:21.640]  Well I can use this one.
[16:21.640 --> 16:28.920]  A good example of modularity improvements made in the release 057 is extracting ZFS
[16:28.920 --> 16:35.080]  code out of kernel as a dynamically linked library, LibSolarisSO, which effectively is
[16:35.080 --> 16:37.440]  a new module.
[16:37.440 --> 16:45.280]  To accomplish that we changed the main OSV make file to build new artifact, LibSolarisSO
[16:45.280 --> 16:51.800]  out of ZFS and Solaris file sets in the make file, which basically used to be linked into
[16:51.800 --> 16:53.240]  kernel.
[16:53.240 --> 17:02.280]  The new library has to be linked with a bind now flag and OSV specific OSVmlog node to
[17:02.280 --> 17:10.320]  force OSV dynamic linker to resolve symbols eagerly and populate the mappings eagerly
[17:10.320 --> 17:11.320]  as well.
[17:11.320 --> 17:18.080]  This basically is done to prevent page faults that would lead to potential deadlocks as the
[17:18.080 --> 17:22.480]  libraries loaded and initialized.
[17:22.480 --> 17:30.880]  The init function ZFS initialized called upon the libraries loaded creates necessary
[17:30.880 --> 17:38.120]  thread pools and registers various callbacks so that the page cache arc, which is adaptive
[17:38.120 --> 17:47.720]  replacement cache from ZFS and ZFS depth driver can interact with relevant code in
[17:47.720 --> 17:50.440]  the ZFS library.
[17:50.440 --> 17:58.840]  On another hand, the OSV kernel needs to expose some around 100 symbols that provides some
[17:58.840 --> 18:05.560]  internal free BSD originating functionality that LibSolarisSO depends on.
[18:05.560 --> 18:13.640]  OSV borrowed some code from free BSD and actually a good chunk of this code was actually implementation
[18:13.640 --> 18:18.000]  of ZFS, which right now is outside of the kernel.
[18:18.000 --> 18:27.320]  Finally, the virtual file system bootstrap code needs to dynamically load LibSolarisSO
[18:27.320 --> 18:37.600]  from bootFS or read-only-FS using DL open before mounting ZFS file system.
[18:37.600 --> 18:44.080]  There are at least three advantages of moving ZFS to a separate library.
[18:44.080 --> 18:50.240]  First off, ZFS can be optionally loaded from another file system like bootFS or read-only-FS
[18:50.240 --> 18:55.920]  partition on the same disk or another disk and I will actually discuss that in more detail
[18:55.920 --> 19:00.080]  in one of the upcoming slides later.
[19:00.080 --> 19:08.040]  Then also, kernel gets smaller by around 800 kilobytes and effectively becomes 3.6 megabytes
[19:08.040 --> 19:09.040]  in size.
[19:09.040 --> 19:14.920]  Finally, there are at least 10 fewer threads that are needed to run non-ZFS image.
[19:14.920 --> 19:22.400]  So for example, when you run read-only-FS image on OSV, with one CPU it only requires
[19:22.400 --> 19:39.720]  25 threads.
[19:39.720 --> 19:45.240]  The regular Linux Jalipsi apps should run fine on kernel with most symbols and standard
[19:45.240 --> 19:52.080]  C++ library hidden, but unfortunately many unit tests which I mentioned and various internal
[19:52.080 --> 20:00.800]  OSV apps which are written mostly in C++, so-called modules, do not, as they had been
[20:00.800 --> 20:08.200]  coded in the past to use those internal C++ symbols from the kernel and we have to do
[20:08.200 --> 20:11.440]  something to deal with that problem.
[20:11.440 --> 20:23.000]  So in the release 057 we introduced some of the C wrapper API which are basically in C
[20:23.000 --> 20:33.760]  style convention and then we changed those modules to use those C wrapper functions instead
[20:33.760 --> 20:38.920]  of C++ code.
[20:38.920 --> 20:44.880]  The benefit is that down the road we might have some newer apps or some newer modules
[20:44.880 --> 20:53.280]  that would use those C wrapper functions and it also may make OSV more modular.
[20:53.280 --> 21:00.680]  As you can see some of those, one of the example is, for example, OSV gets all threads which
[21:00.680 --> 21:12.480]  is basically a function that gives a thread safe way to color, to iterate over threads
[21:12.480 --> 21:21.480]  which, for example, is used in an HTTP monitoring module to list all the threads.
[21:21.480 --> 21:32.240]  A good example of OSV specific modules that uses some internal C++ symbols is HTTP server
[21:32.240 --> 21:33.480]  monitoring.
[21:33.480 --> 21:41.680]  We modify the HTTP monitoring module to stop using internal kernel C++ API.
[21:41.680 --> 21:47.680]  We do it by replacing some of the calls to internal C++ symbols with this new module C
[21:47.680 --> 21:54.480]  style API, symbols from the slide which you saw on the slide before, for example, SCAD
[21:54.480 --> 21:59.800]  with all threads, with this new OSV get all threads function.
[21:59.800 --> 22:09.440]  In other scenarios we fall back to standard G-Lipsy API, for example, the monitoring app
[22:09.440 --> 22:18.720]  used to call OSV current mounts and right now it uses basically getMTNT and function
[22:18.720 --> 22:29.160]  and related ones.
[22:29.160 --> 22:37.440]  So the release 0.57 introduced another built mechanism that allows creating a custom kernel
[22:37.440 --> 22:42.960]  with a specific list of drivers intended to target given hypervisor.
[22:42.960 --> 22:50.080]  Obviously such kernel benefits from even smaller size and better security as all unneeded
[22:50.080 --> 22:55.480]  code, all unneeded drivers are basically excluded during the build process.
[22:55.480 --> 23:02.360]  In essence we introduce a new build script and makefile parameter, driver, driver's profile.
[23:02.360 --> 23:09.240]  This new parameter is intended to specify a driver profile which is simply a list of
[23:09.240 --> 23:19.320]  device drivers to be linked into the kernel and some extra functionality like PCI or ACPI,
[23:19.320 --> 23:21.760]  these drivers depend on.
[23:21.760 --> 23:30.640]  Each profile is specified in a tiny include files with the MK extension under conf profiles
[23:30.640 --> 23:39.400]  arch directory and included by the main makefile as requested by the driver profile parameter.
[23:39.400 --> 23:48.200]  The main makefile has a number of basically if expressions and add conditionally given
[23:48.200 --> 23:57.800]  driver object to the linked object list depending on the value of 0 or 1 of the given conf drivers
[23:57.800 --> 24:03.000]  parameter specified in that include file.
[24:03.000 --> 24:10.480]  The benefit of using drivers as are most profound when they are used with when you build kernel
[24:10.480 --> 24:17.720]  and hide most of the symbols as I talked about in one of the previous slides.
[24:17.720 --> 24:24.600]  It's also possible to enable or disable individual drivers on top of profiles as profiles are
[24:24.600 --> 24:31.000]  basically list of the drivers but the number of configuration parameters that where you
[24:31.000 --> 24:36.560]  can specifically for example include, which I'm going to be actually showing here, you
[24:36.560 --> 24:42.120]  can include specific driver.
[24:42.120 --> 24:46.920]  One may ask a question why not use something more standard like when you config like for
[24:46.920 --> 24:56.520]  example what Unicraft does, well actually OSV has this specific build system and I didn't
[24:56.520 --> 25:03.200]  want to basically now introduce another way of doing things so that's where we basically
[25:03.200 --> 25:14.200]  script build uses the various effectively parameters to for example to hide symbols or
[25:14.200 --> 25:21.240]  specify specific driver profile or list of other parameters.
[25:21.240 --> 25:37.880]  So as you can see in the first example we built default kernel with all symbols hidden
[25:37.880 --> 25:46.360]  and the resulting kernel is around 36, 3.6 megabytes.
[25:46.360 --> 25:56.800]  In the next example we actually use, we built kernel with the VIRTIO over PCI profiles which
[25:56.800 --> 26:07.040]  is like 300 kilobytes smaller and then in the third one we built kernel which is intended
[26:07.040 --> 26:21.400]  to for example for firecracker when we include only VIRTIO block device and networking driver
[26:21.400 --> 26:30.360]  over MMO transport and then just to see basically in a fourth one just to see how large the
[26:30.360 --> 26:35.760]  driver's code in OSV is when you basically use driver profiles base which is basically
[26:35.760 --> 26:45.920]  nothing, no drivers, you can see that roughly 600 kilobytes of the driver's code is roughly
[26:45.920 --> 26:48.320]  600 kilobytes in size.
[26:48.320 --> 26:54.120]  And then in the last one actually option is where you can specify, you use basically
[26:54.120 --> 27:01.280]  driver's profile and then you explicitly say which specific drivers or you know driver
[27:01.280 --> 27:08.600]  related capability like in this case ACPI, VIRTIO FS and VIRTIO NET and PV panic devices
[27:08.600 --> 27:14.800]  you want to use.
[27:14.800 --> 27:23.160]  Actually with the new release of OSV 057 we started publishing new versions of new variations
[27:23.160 --> 27:30.640]  effectively of OSV kernel that correspond to this I thought you know interesting build
[27:30.640 --> 27:38.200]  configuration that I just mentioned and in this example the OSV loader hidden artifacts
[27:38.200 --> 27:45.920]  are effectively the versions of OSV kernel built with most symbols hidden and then for
[27:45.920 --> 27:59.760]  example which will be at the top for both ARM and X86 and then for example right here
[27:59.760 --> 28:07.560]  in the second and third and fourth artifacts basically version of the kernel built for
[28:07.560 --> 28:13.680]  micro VM profile which is effectively something that you would use to run OSV on Firecracker
[28:13.680 --> 28:25.040]  which only has VIRTIO over MMIO transport.
[28:25.040 --> 28:32.240]  Now the release 057 introduced yet another built mechanism and that allows creation
[28:32.240 --> 28:38.360]  of a custom kernel by exporting only symbols required by a specific application.
[28:38.360 --> 28:44.640]  The extra such kernel benefits from the fact that again it's a little bit smaller and
[28:44.640 --> 28:50.640]  tasks offers better security as in essence all unneeded code by that specific application
[28:50.640 --> 28:53.360]  is removed.
[28:53.360 --> 28:59.760]  This new mechanism relies on two scripts that analyze the built manifest, detect application
[28:59.760 --> 29:07.800]  L files, identify symbols required from OSV kernel and finally produce the application
[29:07.800 --> 29:14.320]  specific version script under app version script.
[29:14.320 --> 29:21.360]  The generate app version script iterates over the manifest files produced by list manifest
[29:21.360 --> 29:29.520]  files pi, identifies undefined symbols in the L files using objectDump that are also
[29:29.520 --> 29:35.560]  exported by OSV kernel and finally generates basically the app version script.
[29:35.560 --> 29:40.840]  So please note that this functionality only works when you build kernel with most symbols
[29:40.840 --> 29:43.480]  hidden.
[29:43.480 --> 29:53.240]  So I think what is kind of interesting worth noting in that approach is that you basically
[29:53.240 --> 29:56.680]  run a built script against given application twice.
[29:56.680 --> 30:02.920]  Basically first time to identify all symbols that application needs from OSV kernel and
[30:02.920 --> 30:15.080]  then actually second time we do is to build the kernel for that specific app.
[30:15.080 --> 30:23.960]  In this example we actually generate kernel specific to run a simple going app on OSV
[30:23.960 --> 30:38.480]  and when you actually build kernel with symbols around I think 30 symbols by going pi example
[30:38.480 --> 30:50.520]  the kernel is effectively by around half megabytes smaller and it's around 3.2 megabytes.
[30:50.520 --> 30:59.720]  So this approach has obviously some limitations.
[30:59.720 --> 31:06.920]  So some applications obviously use for example DLSM right to dynamically resolve symbols
[31:06.920 --> 31:10.480]  and those would be missed by this technique.
[31:10.480 --> 31:16.440]  So in this scenario basically for now you have to manually find those symbols and add
[31:16.440 --> 31:20.200]  them to the app version script file.
[31:20.200 --> 31:27.920]  Basically a lot of Jalipsi functionality is still in OSV in Linux CC where all the system
[31:27.920 --> 31:39.360]  calls are actually implemented is still basically references all the code in some of the parts
[31:39.360 --> 31:43.240]  of the Lipsi implementation so this obviously also would not be removed.
[31:43.240 --> 31:54.080]  So obviously we could think of ways of finding some kind of build mechanism that could for
[31:54.080 --> 32:04.680]  example find all the usages of Cisco instruction or SVC on ARM and analyze and find all this
[32:04.680 --> 32:09.520]  only code that is needed.
[32:09.520 --> 32:14.280]  In the future we may componentize other functional elements of the kernel for example the DHCP
[32:14.280 --> 32:19.720]  lookup code could be either loaded from a separate library or compiled out depending
[32:19.720 --> 32:25.120]  on some build option to improve compatibility while also planning to add support of statically
[32:25.120 --> 32:33.240]  linked executables which would require implementing at least clone BRK and arch PRCTL Cisco.
[32:33.240 --> 32:38.680]  We may also introduce ability to swap built in version of Jalipsi libraries with third
[32:38.680 --> 32:45.880]  party ones for example the subset of libm so that is provided by OSV kernel could be
[32:45.880 --> 32:55.280]  possibly hidden with the mechanism that is discussed and we could use different implementation
[32:55.280 --> 32:57.160]  of that library.
[32:57.160 --> 33:04.600]  Finally we are considering to expand standard PROCFS and CISFS and OSV specific parts of
[33:04.600 --> 33:10.640]  CISFS that would better support statically linked executables but also allow regular
[33:10.640 --> 33:13.800]  apps to interact with OSV.
[33:13.800 --> 33:20.640]  A good example of it could be implementation of net stat like type of capability application
[33:20.640 --> 33:31.680]  that could expose the networking in terms of OSV better during runtime.
[33:31.680 --> 33:35.920]  In the next part of the presentation I will discuss the other interesting enhancements
[33:35.920 --> 33:39.360]  introduced as part of the latest 0.57 release.
[33:39.360 --> 33:45.760]  More specifically I will talk about lazy stack and new ways to build ZFS images and finally
[33:45.760 --> 33:51.640]  the improvements to the ARH64 port.
[33:51.640 --> 34:00.080]  The lazy stack which by the way is actually the idea that was felt off by Nadav Harrell
[34:00.080 --> 34:06.760]  which maybe is listening to this presentation effectively allows to save substantial amount
[34:06.760 --> 34:13.320]  of memory if an application spawns many p-threads with large stack by letting stack grow dynamically
[34:13.320 --> 34:18.360]  as needed instead of getting prepopulated ahead of time which is normally the case right
[34:18.360 --> 34:19.720]  now with OSV.
[34:19.720 --> 34:26.040]  So on OSV right now all kernel threads and all application threads have stacks that are
[34:26.040 --> 34:31.200]  automatically prepopulated which is obviously not very memory efficient.
[34:31.200 --> 34:36.400]  Now the crux of the solution is based on observation that OSV page fault handler requires that
[34:36.400 --> 34:44.720]  both interrupts and preemption must be enabled when fault is triggered.
[34:44.720 --> 34:49.760]  And therefore if stack is dynamically mapped we need to make sure that the stack page fault
[34:49.760 --> 34:56.280]  never happens in these relatively few places where the kernel code that executes with either
[34:56.280 --> 35:00.160]  interrupts or preemption disabled.
[35:00.160 --> 35:06.160]  And we basically satisfy this requirement by refolking the stack by reading one byte,
[35:06.160 --> 35:14.280]  one page down per stack pointer just before preemption or interrupts are disabled.
[35:14.280 --> 35:18.760]  So a good example of that code would be in a scheduler right when OSV scheduler is trying
[35:18.760 --> 35:25.440]  to figure out what the next threat to switch to.
[35:25.440 --> 35:31.800]  And obviously that code has preemption and interrupts disabled and we wouldn't obviously
[35:31.800 --> 35:38.240]  want to have page fault happen at that moment.
[35:38.240 --> 35:47.280]  So there are relatively few places when that happens and this idea is to basically pre-fault
[35:47.280 --> 35:48.280]  this code.
[35:48.280 --> 35:54.280]  So to achieve that we basically analyze OSV code to find all the places where the IRQ
[35:54.280 --> 36:01.160]  disabled and preempt disabled is called directly or indirectly sometimes and pre-fault the
[36:01.160 --> 36:03.760]  stack there if necessary.
[36:03.760 --> 36:07.840]  As we analyze all call sites we need to follow basically five rules.
[36:07.840 --> 36:13.240]  The first one do nothing if the call in question executes always on the kernel thread right
[36:13.240 --> 36:19.760]  because it has pre-populated stack there's no chance that page fault is going to happen.
[36:19.760 --> 36:26.960]  Second one is do nothing if the call site executes on other type of pre-populated stack.
[36:26.960 --> 36:34.560]  The good example of that would be the interrupt and exception stack or Cisco stack which are
[36:34.560 --> 36:37.320]  all pre-populated.
[36:37.320 --> 36:44.520]  And the number three rule is do nothing if the call site executes when we know that either
[36:44.520 --> 36:50.760]  interrupts or preemptions are disabled because we don't need to somebody already probably
[36:50.760 --> 36:52.400]  pre-faulted that.
[36:52.400 --> 36:58.080]  And then pre-fault unconditionally if we know that both preemption and interrupts are about
[36:58.080 --> 37:00.000]  to be enabled right.
[37:00.000 --> 37:07.280]  And otherwise pre-fault stack by determining dynamically basically by calling the preemptable
[37:07.280 --> 37:11.320]  is preemptable and IRQ enabled functions.
[37:11.320 --> 37:18.840]  And now the idea basically if we only always if we did if we followed only rule number
[37:18.840 --> 37:24.160]  five which actually this is what I tried to do in the very beginning the first attempt
[37:24.160 --> 37:28.720]  to implement lazy stack it will be actually pretty inefficient.
[37:28.720 --> 37:37.440]  I mean I saw pretty significant degradation of for example context switch and other parts
[37:37.440 --> 37:47.040]  of the OSV when I dynamically checked if preemption and interrupts were disabled.
[37:47.040 --> 37:55.480]  So this was accessible pretty painful to basically analyze the code but I think it was worth it.
[37:55.480 --> 38:00.720]  As you remember from the modularity slides the ZFS file system has been extracted from
[38:00.720 --> 38:07.640]  the kernel as a separate shared library called LipsolizeSO which can be loaded from the different
[38:07.640 --> 38:11.480]  file system before ZFS file system can be mounted.
[38:11.480 --> 38:15.680]  This allows for three ways ZFS can be mounted by OSV.
[38:15.680 --> 38:21.680]  The first and original way assumes that ZFS is mounted at the root from the first partition
[38:21.680 --> 38:23.320]  of the first disk.
[38:23.320 --> 38:29.120]  The second one involves mounting ZFS from the second partition of the first disk and
[38:29.120 --> 38:34.680]  at an arbitrary non-root point for example slash data.
[38:34.680 --> 38:40.000]  Similarly the third way involves mounting ZFS from the first partition of the second
[38:40.000 --> 38:46.200]  or higher disk at an arbitrary non-root point as well.
[38:46.200 --> 38:52.040]  Please note that the second and third options assume that the root file system is non-ZFS
[38:52.040 --> 38:59.480]  obviously and which could be like read-only-FS or boot-FS.
[38:59.480 --> 39:09.120]  This slide shows you the build command and how OSV runs when we follow the original and
[39:09.120 --> 39:13.320]  default method of building and mounting ZFS.
[39:13.320 --> 39:23.080]  For those that have done it, there's nothing really interesting here.
[39:23.080 --> 39:31.400]  This is a new method, the first of the two new ones where we actually allow ZFS to be
[39:31.400 --> 39:38.040]  mounted at a non-root mount point like data for example and mixed with another file system
[39:38.040 --> 39:39.520]  on the same disk.
[39:39.520 --> 39:45.680]  Please note that lib-solaris-so is placed on the root file system typically read-only-FS
[39:45.680 --> 39:50.600]  under USR-lib-FS and loaded from it automatically.
[39:50.600 --> 39:58.960]  The build script will automatically add the relevant mount point time to each ZFS.
[39:58.960 --> 40:06.240]  The last method is basically similar to the one before but this time we allow ZFS to be
[40:06.240 --> 40:13.440]  mounted from the partition from the second disk or another one.
[40:13.440 --> 40:20.160]  It's actually what happens with this option, I noticed that OSV would actually mount ZFS
[40:20.160 --> 40:27.560]  file system by around 30 to 40 milliseconds faster.
[40:27.560 --> 40:38.920]  Now there's another new feature we used to run in order to build ZFS images and file
[40:38.920 --> 40:43.760]  system we would use OSV itself to do it.
[40:43.760 --> 40:50.360]  With this new release there's a specialized version of the kernel called ZFS loader which
[40:50.360 --> 40:58.480]  basically delegates to this utilities like Zipple, ZFS and so on to mount OSV but there's
[40:58.480 --> 41:09.160]  also now a new script called ZFS image on host that can be used to mount OSV ZFS images
[41:09.160 --> 41:14.080]  provided you have open ZFS functionality on your host system which is actually quite nice
[41:14.080 --> 41:23.200]  because you can mount basically OSV disk and introspect it, you can also modify it using
[41:23.200 --> 41:30.560]  standard Linux tools and unmount it and use it on OSV again.
[41:30.560 --> 41:38.720]  Here's some help on how this script can be used.
[41:38.720 --> 41:43.000]  Now I think I don't have much time left but I will try.
[41:43.000 --> 41:50.200]  So there's also, I will focus a little bit on the AR64 improvements, I will focus on
[41:50.200 --> 41:57.240]  three things that I think are worth mentioning, the changes to dynamically map the kernel
[41:57.240 --> 42:07.880]  during the boot from the second basically gigabyte of visual memory to the 63rd gigabyte
[42:07.880 --> 42:15.280]  of memory, addition enhancements to handle system calls and then also handle exceptions
[42:15.280 --> 42:19.920]  on a dedicated stack.
[42:19.920 --> 42:25.400]  As far as the moving memory, virtual memory to the 63rd gigabyte so I'm not sure if you
[42:25.400 --> 42:33.600]  realize OSV kernel is actually position dependent but obviously the kernel itself may be loaded
[42:33.600 --> 42:39.160]  in different parts of physical memory so and it used to be before that release that you
[42:39.160 --> 42:44.360]  would have to build different versions for Firecracker or for the QEMU.
[42:44.360 --> 42:51.760]  So basically we, in this release we changed the logic in the assembly, in a bootloader
[42:51.760 --> 43:00.960]  where we basically OSV detects itself where it is in a physical memory and in essence
[43:00.960 --> 43:10.400]  the, you know, dynamically the early mapping tables to then eventually bootstrap to the
[43:10.400 --> 43:15.240]  right place in the positional code.
[43:15.240 --> 43:20.600]  So now basically you don't need to, you can use the same version of the kernel on any
[43:20.600 --> 43:21.760]  hypervisor.
[43:21.760 --> 43:31.200]  Now we had system calls on ARM, we had to handle the SVC instruction, there's not really
[43:31.200 --> 43:41.440]  much interesting if you know how that works and what is maybe a little bit more interesting
[43:41.440 --> 43:48.520]  was the change that I made to make all exceptions including system calls to work on a dedicated
[43:48.520 --> 43:55.040]  stack so before that change all exceptions would be handled on the same stack as the
[43:55.040 --> 44:02.720]  application which was, which wasn't you know really, which caused all kinds of problems
[44:02.720 --> 44:08.520]  and it was, for example, that would effectively prevent implementation of the lazy stack.
[44:08.520 --> 44:21.040]  So to support basically that we would, you know, SV which runs in EL1 in a kernel mode
[44:21.040 --> 44:26.320]  we would basically take advantage of the stack selector register and we would have, we would
[44:26.320 --> 44:34.040]  basically use both stack pointer register SPL0 and SPL1.
[44:34.040 --> 44:43.360]  So normally OSV uses SPL1 register to points to the stack for each thread.
[44:43.360 --> 44:53.080]  So with the new implementation what basically we would do before the exception was taken
[44:53.080 --> 45:03.960]  basically we would switch the stack pointer selector to SPL0 and once basically the exception
[45:03.960 --> 45:11.960]  was handled it would basically go back to normal which was SPL1.
[45:11.960 --> 45:24.280]  I think I was going to skip C with FS because we're running very little half time left but
[45:24.280 --> 45:27.280]  you can read it on that.
[45:27.280 --> 45:39.200]  Yeah we've also added netlink support and we've made quite many improvements to VFS layer
[45:39.200 --> 45:47.440]  so both actually of those netlink and VFS improvements were done to support C with FS
[45:47.440 --> 45:53.080]  so there are basically more gaps that have been filled by trying to run this new use
[45:53.080 --> 45:56.080]  case.
[45:56.080 --> 46:01.840]  So just briefly as we are pretty much at the end of the presentation I think in the next
[46:01.840 --> 46:06.280]  releases of OSV whenever they're going to happen I would like to, I would like us to
[46:06.280 --> 46:12.640]  focus on supporting statically linked executables, adding proper support of spin locks because
[46:12.640 --> 46:19.400]  OSV for example Mutex right now is lockless but under high contention it would actually
[46:19.400 --> 46:25.120]  make sense to use spin locks and we have actually a prototype on that on the mailing
[46:25.120 --> 46:30.160]  list and then supporting ASLR, refreshing Capstan which is a build tool which hasn't
[46:30.160 --> 46:35.600]  been really out because we don't have volunteers, improved for a long time and then even the
[46:35.600 --> 46:43.040]  website and there are many other interesting ones and so I would, as a last slide I would
[46:43.040 --> 46:49.800]  like to basically use this as occasion to thank basically organizer Razvan for inviting
[46:49.800 --> 46:59.320]  me and everybody else from the community of Unikernals and I would also want to thank
[46:59.320 --> 47:07.880]  ScyllaDB for supporting me and Dorlaor and Nadav Harrell for reviewing all the patches
[47:07.880 --> 47:15.960]  and his other improvements and I also want to thank all other contributors to OSV and
[47:15.960 --> 47:22.080]  I also would like to invite you to join us because there are not many of us and if you
[47:22.080 --> 47:29.920]  want to have OSV alive we definitely need you and so there are some resources about
[47:29.920 --> 47:36.560]  OSV, there's my P99 presentation here as well and yeah if you guys have any questions I'm
[47:36.560 --> 47:37.840]  happy to answer them, thank you.
[47:37.840 --> 47:44.360]  Thank you Voldemort, thank you.
[47:44.360 --> 47:48.880]  So any questions for Voldemort, yeah please Marta, just ask it's going to be a bit of
[47:48.880 --> 47:49.880]  the mic.
[47:49.880 --> 47:55.880]  Okay I have two questions, first when you have spoken about the symbols, about the
[47:55.880 --> 48:03.160]  G-Lipsy symbols and the symbols for symbols, do I understand it correctly that the problem
[48:03.160 --> 48:09.640]  is that the kernel might be using some G-Lipsy functions and the applications might be linked
[48:09.640 --> 48:15.400]  to its own G-Lipsy and so-so symbols apply basically?
[48:15.400 --> 48:19.800]  Well not really, they would use the same version it's just you know and there's no problem
[48:19.800 --> 48:26.400]  with for example malloc, like malloc we don't want to expose malloc but there is a good
[48:26.400 --> 48:32.680]  chunk of OSV is implemented in C++ and all of those symbols don't need to be exposed
[48:32.680 --> 48:38.480]  because they inflate the symbols table a lot and they are not, they shouldn't be really
[48:38.480 --> 48:47.120]  you know available to, visible to others and yeah I mean now I think OSV exposes if you
[48:47.120 --> 48:53.360]  build with that option around I think sixteen hundreds of symbols instead of you know seventeen
[48:53.360 --> 48:54.360]  thousands.
[48:54.360 --> 48:57.280]  So it's really about the binary size there?
[48:57.280 --> 49:05.360]  Yeah, yeah basically binary size and with in case of C++ library avoiding a collision
[49:05.360 --> 49:11.120]  where you build OSV with different version of C++ library versus you know the application
[49:11.120 --> 49:12.120]  that.
[49:12.120 --> 49:18.680]  Yeah okay so this is the case I'm interested in, so have you thought about maybe renaming
[49:18.680 --> 49:25.800]  the symbols in the kernel image during link time, maybe adding some prefixes to all the
[49:25.800 --> 49:31.360]  symbols so that you can have them visible but they would not clash?
[49:31.360 --> 49:35.440]  That's an interesting idea, I haven't thought about it yeah.
[49:35.440 --> 49:36.960]  And Marty the other second question, yeah.
[49:36.960 --> 49:41.960]  Yeah that's just a quick second question, so when you have spoken about the latest tag
[49:41.960 --> 49:48.960]  you said that you pre-fold the stack to avoid the problematic case when it drops in preemption
[49:48.960 --> 49:55.200]  disabled, so basically when I'm thinking about it you still need to have some kind of upper
[49:55.200 --> 50:02.200]  bound of the size of the stack so that you know that you pre-fold it large enough to
[50:02.200 --> 50:05.200]  not get into the issue.
[50:05.200 --> 50:12.200]  So my question is why not then have the kernel stacks in all fixed size because if you already
[50:12.200 --> 50:18.200]  need to have some upper bound then why not have a local upper bound for the whole kernel?
[50:18.200 --> 50:19.760]  Wouldn't it be just easier?
[50:19.760 --> 50:26.720]  Well I mean this is for applications threads only, so for application stacks where the
[50:26.720 --> 50:33.600]  kernel threads would still have the pre-populated fixed size stack, yeah so because I mean there
[50:33.600 --> 50:39.520]  are many applications like good example is Java that would start like 200 threads and
[50:39.520 --> 50:45.480]  all of them right now are proposed like one megabyte and all of a sudden need like 200,
[50:45.480 --> 50:46.480]  so this is just for application.
[50:46.480 --> 50:55.480]  Okay so basically my understanding is wrong, so you have the user stack and the kernel
[50:55.480 --> 50:57.800]  stack is the same stack?
[50:57.800 --> 51:08.240]  Well no, it's in the same virtual memory but yeah, I mean when I say kernel stack I mean
[51:08.240 --> 51:13.360]  in OSV basically there are two types of threads, there are kernel threads and there are application
[51:13.360 --> 51:20.360]  threads so basically application threads use their own stack, but when they enter the kernel
[51:20.360 --> 51:25.680]  so to speak they are still reusing the original stack right?
[51:25.680 --> 51:31.400]  I mean application threads use application stack and kernel use kernel and when I say
[51:31.400 --> 51:41.200]  like some kernel code obviously because unicernel as the code executes in an application it
[51:41.200 --> 51:49.880]  runs on application stack but it might execute some kernel code as well which yeah, yeah.
[51:49.880 --> 51:52.040]  Thank you, any other question?
[51:52.040 --> 52:12.160]  Okay thank you so much, let's move on.