Perfect. So, welcome everybody. My name is Simon Künzer. I'm the project founder actually
of Unicraft and lead maintainer there. I'm also CTO and co-founder of Unicraft TMBH,
which is a commercial open source company using the Unicraft project for building a
new cloud. You saw some aspects actually from Rasmus talk. This was like three talks earlier
than mine. Here I'll go really now into much more technical details, how it looks like on the
kernel side. So, this is much more OS class here, what I'm doing. So, first of all, briefly,
wave your hands up if you know what a Unicron is. Good. Okay. So, then I do it here really
quick. In the end, what we are doing is turning an application into a kernel by using something
that we call operating system libraries that are directly baked to an application and we run
that as a virtual machine. So, then we're all on the same page. Especially the Unicraft project,
what's for us important is the aspect of single purpose. So, we do specialization and saying the
OS layers are built just for that particular application that we're interested in and that
allows you a lot of optimizations that you can do. For instance, we do single address space because
we run one application in one kernel. We have a really small TCP and a small memory footprint.
Behind the scenes, we have something that we call the micro library pool. So, you have decomposed
OS functionality as library available. So, this is for instance scheduling file systems,
then libraries that do Linux or POSIX APIs so that ease programming with that environment. So,
it's kind of an SDK. The current project focus is on Linux compatibility because our vision is
actually we want seamless application support for existing Linux applications. And since most
software is written for Linux, especially for the cloud, we want to remove as many obstacles that
we can do for running them on top of Unicraft. And one aspect as Vassim was showing is the
tooling side. We have actually two approaches from the kernel side. We can do application
compatibility natively, meaning you take the sources of your application and compile it together
with the libraries of your kernel. But we can also do binary compatibility and this is what I'm
going to speak about here, where we provide the Linux ABI to the application. So, we do system
calls, VGSOs, etc., etc., to support a pre-compiled application. And now I'm going over these
individual steps. So, first of all, when you build a kernel and want to make that supporting
running a Linux application, what you need to understand or the kernel needs to understand is
the ELF format. Running that application ELF is actually a pretty straightforward process. So,
you need to first be able to pass the ELF, load it in your memory space, then you need to prepare
an entrance stack. This is all specifications that you need to look up and fill out the specific
vectors and values in there, and then jump to the entrance. And then that application is working.
The only interaction that you will then, from that time on, have with the application are
actually system calls. There's two interesting things here regarding loading ELF applications.
So, you have something called non-PIE applications and something called PIE applications, meaning
non-position independent... Sorry, non-position independent, position independent. The non-position
dependent, they dictate you the address layout in your unicolonel. And since we want to run
everything with a single address space, it means we can support only one non-PIE application.
The beauty of PIE is we can relocate the application in the address space. So, we can still work
with a single address space, but can run multiple applications if you want. And, sorry, if we're
going a bit more further with the original reason why PIE is now pretty common in operating systems,
we could even apply with that single address space full stack ASLR, where we mix kernel libraries
and application libraries completely spread in the address space. I give you some more readings
when you download the slides, but basically in the project we are focusing on PIE applications,
since most of them, at least the last five years, is the standard when you get a pre-compiled
binary from the distro. System calls. So, we are in a single address space, we have a pre-compiled
binary from Linux and we want to run it, so we have to do system calls. This is actually a long
pipeline that we need to run. It starts with a Cisco instruction, so note here this is now x86
specific, but that instruction usually takes care of a protection domain switch from ring 3 to
ring 0, but we run in a single protection domain, actually we don't need that, so we go from ring 0
to ring 0, but yeah, we need to execute that instruction. Then Linux requires us to have an
own kernel side stack, especially language environments like Go. They don't give you a big
enough stack that you can just continue executing on the kernel side, so you need to switch. In
reality, of course, if you have an application that has a big enough stack, you could also
configure Unicraft to get rid of this step. Then at the moment we are using full CPU features
also on the Unicraft side, it's mainly coming from for supporting native applications where
basically addressing the kernel is just a function call, so why would you restrict your CPU
features? If you don't compile it with that, then you can also remove that step, but if you do that
in a default option, we need to save and restore that state 2 of your CPU. Then we have a TLS switch,
which we really require because we use the TLS for our OSD composites, you know,
libraryization, sorry, because we didn't want to have a central big TCP control structure where
we need to maintain every particular field. We want to have everything clearly decomposed,
so the TLS was the way to go for that. Yeah, and then actually we're able to call the system
call handler. That brings you to the question, does it have to be that complicated? We need to
do really so many steps, isn't there something we can do a bit better with a single address space?
And actually there is, and there is this mechanism called VDSO and also kernel VSS call,
this is an old thing, so the VDSO basically for us is just a catalog to look up kernel symbols,
so this is a way for us to provide the function addresses to the application that the application
can directly call because again, single address space, single protection domain,
we don't need to do a switch. The second thing is the kernel VSS call symbol,
which was a standardized symbol in the past, mostly for x86 when CPUs were introduced to do
sysenter and syscall instead of interrupt driven system calls, that was a way for the S,
from the US to tell the application to enter the kernel more efficiently. For us,
we can use it as a function call to directly jump in the system call handler, so it's no trap,
no interrupt, no privileged domain switch, no need to save and restore of extended registers
because even that is covered by the system V calling convention. And the idea here is we just
need to patch the libc shared library of that application and then most system calls anyways
go through the libc because they provide wrappers there and then we have that covered.
So in comparison, no expensive syscall instruction anymore, it's just a function call,
the stack maybe we needed depends still, no floating point registers, whatever I mean,
you don't need to save anymore and the TNS register we need to switch still.
Okay, now we get into this fork dilemma, how I call it, this is like always the bad word for
unicolonial people. So this is a bit of OS class here, so you probably remember fork means
duplicating the address space of your application so that everything is at the same address.
Problem with this is it's a second address space, I don't want a second address space,
I want everything flat in a single address space to save time in context switches or TLB flashes,
right? So, doesn't work. What we actually would need in a unicolon environment is a fork version
that forks the application to a different address. Unfortunately, without compiler support, that's
not that easy. You could of course copy the memory region but you need to then start patching your
application because you have absolute addresses in there like return addresses on stacks, pointer
addresses, etc. So don't work that great but what we can do, sorry, for instance, let's say,
what the question is is when you look at the applications, they're compilers position independent.
In principle, the result is that I can run multiple applications in a single address space.
It's just that this model with fork and exit doesn't fit, you would say, because we have this fork
intermediate step. And luckily, there was also this was coming from, let's say, older generations
of Linux when they were trying to run Linux on targets without MMU. There's a system called V fork
that allows you that. It's a bit of a funky fork call because it doesn't fork actually. It just pauses
your parent and allows you to execute within the memory space of the parent, the child, for a short
period of time, actually until the child calls either exit or one of the execution functions to
load a new application binary. And that is basically our solution for running multiple
applications or like a shell or something where the shell forks, you load a position independent
binary anyway in the next step, then you have it in a different address area and you execute.
What we want to try out here is if that mechanism works great for, you know, if you could just
replace the fork system call number internally with a V fork and see for how many applications
that works great because they were doing fork exit before and for how many that does work.
But obviously, won't work is for applications that just use fork to open up worker processes.
But just for this fork exit model that works.
Okay, and then the last point is a bit, I really rushed through.
I was when I was preparing, I was checking, okay, I can't say this, there's no time, no time.
So here I wanted to give you a bit of an overview of what we did the last year when we looked at
different applications and we're always in front of that question. We need to have Linux
compatibility, but at the same time, we don't want to re-implement Linux. And there's some
aspects where it's really tricky. For instance, the first one is a really interesting example. It's
about querying from the kernel side network interfaces or setting up routing tables, etc.
So get if address and call, which normally needs a complete subsystem of sockets to be
implemented. And then on top of a known protocol called netlink, just to do this user space kernel
interaction for gathering that information. An alternative here could be, of course, we implemented
that, but an alternative here could be also to make use of the VDSO again, patch the libc
and do more direct calls for just querying the IP address from the kernel that's currently set.
Another interesting point is applications that are so mean that implicitly rely on a specific
behavior on Linux. And really, a funny example is primitive scheduling that you come across
from time to time. So far we have cooperative scheduling in Unicraft because that's a really
efficient for us scheduling way to do things. But then you, I mean, you put something like
Frank in PHP or MySQL and they use busy waiting loops to wait for other threads to wake up.
With a cooperative scheduler, you can imagine what happens. Basically nothing because you're
constantly busy waiting but never give another threat to chance to pop up.
Then there is this whole topic about which system calls you need to support, which ones you can
stop, which one need to be completely implemented. It's actually true that you can stop a lot,
you don't need all of them. I would here refer you to a nice paper from
Esplos. Actually last year for STEM, the authors were giving a presentation here too about how to
figure out which system calls you need to implement. Of course, there's also sometimes the application
dependent, but there is a, you don't need to implement all of the Linux system calls to have
Linux compatibility. There are really a lot of system calls that are really specific to some
cases or setting up seed groups or whatever, which normal applications don't use.
And then of course the whole topic about file system hierarchy standard,
where of course application expect you have something under PROC or under ETC or somewhere.
So far, we were able to go around that by providing a meaningful filled text file
for the application, especially the PROC file system without implementing that yet.
And that worked for NGINX, Node.js, Redis, HA Proxy and a lot of other number of applications.
Okay, so now we're so fast through this, I'm sorry. So we have time for some questions. I guess
there's some stuff for more clarity. We are an open source project. These are the resources where
you can find us. And you can also, I mean, see that I put here KraftCloud, that logo. This is
what we currently try to build with our company, which is a cloud that uses the beauty of unicronals
for really fast bootups, high resource efficiency for serverless architectures,
microservices, functional services, etc. Unfortunately, just two minutes for the
questions, but still, are there any questions? So when you run everything in a single address
base, do you actually need to enable paging at all? So yeah, with the CPU, actually we need to
enable paging, but it allows you to build a page table at compile time. And then it's just
switching that page table on during boot. What additionally happens with Linux applications
that are sometimes doing MAP calls and mapping here, something or file there, of course, if you
enable that support, then you need some kind of dynamic page table handling. But if you don't need
that, you have the opportunity to have a compile time page table. So we don't have the time to
discuss it, but I was wondering if you have paging, wouldn't you be able to use copy and write to
the popular one? Maybe something to think about. Of course, we do also copy and write where you
need it. The thing is, what we still want is a single address space. So that page table is,
basically, there's just one page table, another page table per application. We don't have this
page table switches, no TLB flushes. So this is where we gain actually a lot of performance
also from. And since we say we are a single application, we run only one thing, why do I need
to handle? Everything that runs in the unique kernel is defined to be trusted,
and you have then the hard isolation boundaries outside from the hard-provise environment to
protect anything that is going bad or an overtaken unique conference.
If you write, protect the data pages of a process that does fork, you can actually
detect processes that don't use default exec that do actual fork to share memory,
and you would be able to detect that.
I would just add to that because you would have multiple other spaces just for a short while,
so it's not really a performance issue, right?
Yeah, it's like two kinds, implementation effort and... But yeah, I see your point.
Also for non-position independent applications, I mean, if you have a choice not supporting
multiple of them and having multiple address spaces, I mean, why not go for multiple address
spaces, it does not invalidate the unique kernel idea.
No, no, no, it doesn't. It doesn't. It just comes with some cost, right?
Right.
Okay, thank you very much. We have to switch to another talk. Thanks again.