[00:00.000 --> 00:08.720] All right, so we move on to our next talk. [00:08.720 --> 00:12.760] We have Udo here with the NOVA microhypervisor update. [00:12.760 --> 00:13.760] Udo, please. [00:13.760 --> 00:14.760] Thank you, Arsalan. [00:14.760 --> 00:16.600] Good morning, everybody. [00:16.600 --> 00:18.800] Welcome to my talk at FOSSTEM. [00:18.800 --> 00:21.280] It's good to be back here after three years. [00:21.280 --> 00:29.000] The last time I presented at FOSSTEM, I gave a talk about the NOVA microhypervisor on V8, [00:29.000 --> 00:34.680] and this talk will cover the things that happened in the NOVA ecosystem since then. [00:34.680 --> 00:37.040] So just a brief overview of the agenda. [00:37.040 --> 00:41.640] For all of those who might not be familiar with NOVA, I'll give a very brief architecture [00:41.640 --> 00:44.600] overview and explain the NOVA building blocks. [00:44.600 --> 00:49.840] Then we look at all the recent innovations that happened in NOVA in the last three years. [00:49.840 --> 00:53.960] I'll talk a bit about the code unification between ARM and X86, the two architectures [00:53.960 --> 00:56.160] that we support at this point. [00:56.160 --> 01:01.360] And then I'll spend the majority of the talk going into details, into all the advanced [01:01.360 --> 01:08.200] security features, particularly in X86 that we added to NOVA recently. [01:08.200 --> 01:11.640] And towards the end, I'll talk a little bit about performance, and hopefully we'll have [01:11.640 --> 01:13.880] some time for questions. [01:13.880 --> 01:21.200] So the architecture in NOVA is similar to the microkernel-based systems that you've [01:21.200 --> 01:27.160] seen before. At the bottom, we have a kernel, which is not just a microkernel, it's actually [01:27.160 --> 01:30.480] a microhypervisor called a NOVA microhypervisor. [01:30.480 --> 01:35.560] And on top of it, we have this component-based multi-server user mode environment. [01:35.560 --> 01:37.760] Genote would be one instantiation of it. [01:37.760 --> 01:42.680] And Martin has explained that most microkernel-based systems have this structure. [01:42.680 --> 01:46.960] In our case, the HostOS consists of all these colorful boxes. [01:46.960 --> 01:51.400] We have a master controller, which is sort of the init process, which manages all the [01:51.400 --> 01:55.080] resources that the microhypervisor does not need for itself. [01:55.080 --> 01:56.320] We have a bunch of drivers. [01:56.320 --> 02:00.200] All the device drivers run in user mode, they're privileged. [02:00.200 --> 02:05.640] We have a platform manager, which primarily deals with resource enumeration and power [02:05.640 --> 02:06.960] management. [02:06.960 --> 02:12.160] You can run arbitrary host applications, many of them. [02:12.160 --> 02:17.160] And there's a bunch of multiplexers, like you want multiplexer that everybody can get [02:17.160 --> 02:22.960] a serial console and you have a single interface to it, or a network multiplexer which acts [02:22.960 --> 02:25.080] as some sort of virtual switch. [02:25.080 --> 02:29.360] And virtualization is provided by virtual machine monitors, which are also user mode [02:29.360 --> 02:30.920] applications. [02:30.920 --> 02:37.440] And we have this special configuration or this special design principle that every virtual [02:37.440 --> 02:41.280] machine uses its own instance of a virtual machine monitor. [02:41.280 --> 02:42.680] They don't all have to be the same. [02:42.680 --> 02:47.560] For example, if you run a unique kernel in VM, as shown to the far right, the virtual [02:47.560 --> 02:52.200] machine monitor could be much smaller because it doesn't need to deal with all the complexity [02:52.200 --> 03:00.160] that you would find in an OS, like Linux or Windows. [03:00.160 --> 03:08.720] So the entire HostOS consisting of the Nova Microhypervisor and the HostOS, the user mode [03:08.720 --> 03:15.320] portion of it is what Bedrock calls the ultravisor, which is a product that we ship. [03:15.320 --> 03:20.600] And once you have a virtualization layer that is very small, very secure, and basically [03:20.600 --> 03:26.280] sits outside the guest operating system, you can build interesting features like virtual [03:26.280 --> 03:31.960] machine introspection or virtualization assisted security, which uses features like nested [03:31.960 --> 03:39.440] paging, breakpoints, and patched civil overrides to harden the security of the guest operating [03:39.440 --> 03:46.240] systems, like protecting critical data structures, introspecting memory, and also features in [03:46.240 --> 03:52.880] the virtual switch for doing access control between the different virtual machines and [03:52.880 --> 03:56.480] the outside world as to who can send what types of packets. [03:56.480 --> 04:01.560] And all of that is another product which is called ultra security. [04:01.560 --> 04:08.280] The whole stack, not just the kernel, the whole stack is undergoing rigorous formal verification. [04:08.280 --> 04:12.480] And one of the properties that this formal verification effort is proving is what we [04:12.480 --> 04:14.760] call the bare metal property. [04:14.760 --> 04:21.560] And the bare metal property basically says that combining all these virtual machines on [04:21.560 --> 04:28.160] a single hypervisor has the same behavior as if you were running these as separate physical [04:28.160 --> 04:34.680] machines connected by a real ethernet switch, so that whatever happens in a virtual machine [04:34.680 --> 04:39.000] could have happened on a real physical machine that was not virtualized. [04:39.000 --> 04:42.640] That's what the bare metal property says. [04:42.640 --> 04:49.680] So the building blocks of NOVA are those that you would find in an ordinary microkernel. [04:49.680 --> 04:53.920] It's basically address basis, threads, and IPC. [04:53.920 --> 04:58.520] And NOVA address basis are called protection domains, or PD. [04:58.520 --> 05:03.760] And threads or virtual CPUs are called execution context, short EC. [05:03.760 --> 05:09.480] And for those of you who don't know NOVA very well, I've just given a very brief introductory [05:09.480 --> 05:12.840] slide for how all these mechanisms interact. [05:12.840 --> 05:16.360] So let's say you have two protection domains, PD, A and B. [05:16.360 --> 05:18.760] Each of them have one or more threads inside. [05:18.760 --> 05:23.480] And obviously, at some point, you want to intentionally cross these protection domain [05:23.480 --> 05:27.640] boundaries because these components somehow need to communicate. [05:27.640 --> 05:29.320] And that's what IPC is for. [05:29.320 --> 05:33.360] So assume that this client thread wants to send a message to the server thread. [05:33.360 --> 05:38.720] It has a thread control block, which is like a message box, puts the message in there, [05:38.720 --> 05:43.480] invokes a call, an IPC call to the hypervisor, it vectors through a portal, which routes [05:43.480 --> 05:48.880] that IPC to the server protection domain, and then the server receives the message in [05:48.880 --> 05:50.880] its UTCB. [05:50.880 --> 05:55.840] As part of this control and data transfer, the scheduling context, which is a time slice [05:55.840 --> 05:59.440] coupled with a priority, is donated to the other side. [05:59.440 --> 06:04.480] And as you can see on the right, that's the situation after the IPC call has gone through. [06:04.480 --> 06:09.120] So now the server is executing on the scheduling context of the client. [06:09.120 --> 06:15.760] The server computes a reply, puts it in its UTCB, issues a hypercall called IPC reply, [06:15.760 --> 06:20.840] and the data goes back, the reply goes back to the client, the scheduling context donation [06:20.840 --> 06:24.160] is reverted, and the client gets its time slice back. [06:24.160 --> 06:29.760] So what you get with that is very fast synchronous IPC, this time donation, and priority inheritance. [06:29.760 --> 06:33.960] And it's very fast because there's no scheduling decision on that pass. [06:33.960 --> 06:41.280] Also, NOVA is a capability-based microkernel or hypervisor, which means all operations [06:41.280 --> 06:46.040] that user components do with the kernel have capabilities as parameters. [06:46.040 --> 06:51.840] And capabilities have the nice property that they both name a resource and at the same [06:51.840 --> 06:56.080] time have to convey what access you have under that resource. [06:56.080 --> 07:00.600] So it's a very powerful access control primitive. [07:00.600 --> 07:06.280] So that said, let's look at all the things that happened in NOVA over the last two and [07:06.280 --> 07:08.520] a half or so years. [07:08.520 --> 07:13.320] And we are now on a release cadence where we put out a new release of NOVA approximately [07:13.320 --> 07:15.200] every two months. [07:15.200 --> 07:20.720] So it's always the year and the week of the year where we do releases, and this shows [07:20.720 --> 07:26.800] what we added in NOVA in 21, 22, and what we'll add to the first release of this year [07:26.800 --> 07:28.760] at the end of this month. [07:28.760 --> 07:36.920] So we started out at the beginning of 21 by unifying the code base between X86 and ARM, [07:36.920 --> 07:42.560] making the load address flexible, adding power management like suspend-resume, then extended [07:42.560 --> 07:45.440] that support to ARM. [07:45.440 --> 07:51.160] And later in 22, when that unification was complete, we started adding a lot of, let's [07:51.160 --> 07:57.560] say, advanced security features in X86, like control flow enforcement, code patching, [07:57.560 --> 08:03.960] cache allocation technology, multiple spaces, multi-key total memory encryption. [08:03.960 --> 08:07.920] And recently, we've added some APIC virtualization. [08:07.920 --> 08:12.000] So the difference between the things that are listed in bold here and those that are [08:12.000 --> 08:16.840] not listed in bold, everything in bold I'll try to cover in this talk, which is a lot, [08:16.840 --> 08:20.760] so hopefully we'll have enough time to go through all this. [08:20.760 --> 08:24.000] First of all, the design goals that we have in NOVA. [08:24.000 --> 08:28.640] And Martin already mentioned that not all microchones have the same design goals. [08:28.640 --> 08:33.080] Our design goal is that we want to provide the same or at least similar functionality [08:33.080 --> 08:39.280] across all architectures, which means the API is designed in such a way that it abstracts [08:39.280 --> 08:42.120] from architectural differences as much as possible. [08:42.120 --> 08:48.800] That you get a uniform experience, whether you're on X86 and ARM, you can create a thread [08:48.800 --> 08:54.440] and you don't have to worry about details of instructions that register set, page table [08:54.440 --> 08:57.920] format, NOVA tries to abstract all of that away. [08:57.920 --> 09:05.440] You want to have a really simple build infrastructure and you'll see in a moment what the directory [09:05.440 --> 09:09.880] layout looks like, but suffice it to say that you can build NOVA with a very simple make [09:09.880 --> 09:16.320] command where you say make architecture equals X86 or ARM, and in some cases, bold equals [09:16.320 --> 09:24.440] I don't know, Raspberry Pi or NXP, I'm X8, whatever, and it runs for maybe five seconds [09:24.440 --> 09:28.320] and then you get a binary. [09:28.320 --> 09:34.960] We use standardized processes like the standardized boot process and standardized resource enumeration [09:34.960 --> 09:39.480] as much as possible because that allows for a great reuse of code. [09:39.480 --> 09:47.440] So we use multi-boot version two or one, and if I for booting, we use ACPI for resource [09:47.440 --> 09:48.440] enumeration. [09:48.440 --> 09:52.120] You can also use the FDT, but that's more of a fallback. [09:52.120 --> 09:57.100] And for ARM, there's this interface called PSCI for power state coordination that's also [09:57.100 --> 10:02.120] abstracting this functionality across many different ARM boards. [10:02.120 --> 10:05.560] So we try to use these interfaces as much as possible. [10:05.560 --> 10:11.520] The code is designed in such a way that is formally verifiable, and in our particular [10:11.520 --> 10:17.320] case, that means formally verifying highly concurrency plus plus code, not C code, not [10:17.320 --> 10:25.920] a similar code, but C++ code, and even weekly ordered memory because ARM V8 is weak memory. [10:25.920 --> 10:31.920] And obviously, we want to be, we want Nova to be modern, small, and fast, best in class [10:31.920 --> 10:36.760] security and performance, and we'll see how we did in that. [10:36.760 --> 10:42.560] So first, let me talk about the code structure, and Martin mentioned in this talk this morning, [10:42.560 --> 10:45.680] that using directories to your advantage can really help. [10:45.680 --> 10:53.280] So on the right, you see the directory structure that we have in the unified Nova code base. [10:53.280 --> 10:56.440] We have a generic ink directory and a generic source directory. [10:56.440 --> 10:58.440] Those are the ones listed in green. [10:58.440 --> 11:06.000] And then we have architecture-specific subdirectories for ARC64 and X8664, and we have architecture-specific [11:06.000 --> 11:07.000] build directories. [11:07.000 --> 11:11.240] There's also a doc directory in which you will find the Nova interface specification, [11:11.240 --> 11:14.320] and there's a single make file unified. [11:14.320 --> 11:19.600] And when we looked at the source code and we discussed them with our formal methods engineers, [11:19.600 --> 11:26.400] we recognized that basically all the functions can be categorized into three different buckets. [11:26.400 --> 11:30.400] The first one is what we call same API and same implementation. [11:30.400 --> 11:32.680] This is totally generic code. [11:32.680 --> 11:34.720] All the system calls are totally generic code. [11:34.720 --> 11:37.720] All the memory allocators are totally generic code. [11:37.720 --> 11:41.280] Surprisingly, even page tables can be totally generic code. [11:41.280 --> 11:46.240] So these can all share the source files, the header files, and the spec files, which basically [11:46.240 --> 11:49.680] describe the interface pre and post conditions. [11:49.680 --> 11:55.200] The second bucket is functions that have the same API, but maybe a different implementation. [11:55.200 --> 11:59.800] And an example of that would be a timer where the API could be set a deadline for when a [11:59.800 --> 12:01.600] timer interrupts should fire. [12:01.600 --> 12:05.880] So the API for all callers is the same, so you can potentially share the header or the [12:05.880 --> 12:07.120] spec file. [12:07.120 --> 12:13.080] But the implementation might be different on each architecture or is very likely different. [12:13.080 --> 12:16.560] And the final bucket is those functions that have a different API and implementation and [12:16.560 --> 12:18.480] you can't share anything. [12:18.480 --> 12:25.120] So the code structure is such that architecture-specific code lives in the architecture-specific subdirectories [12:25.120 --> 12:28.960] and generic code lives in the sort of parent directories of that. [12:28.960 --> 12:33.600] And whenever you have an architecture-specific file with the same name as a generic file, [12:33.600 --> 12:39.880] the architecture-specific file takes precedence and basically overrides or shadows the generic [12:39.880 --> 12:40.880] file. [12:40.880 --> 12:47.640] And that makes it very easy to move files from architecture-specific to generic and back. [12:47.640 --> 12:52.600] So the unified code base that we ended up with, and these are the numbers from the very [12:52.600 --> 13:01.080] recent upcoming release, 2308, which will come out at the end of this month, shows sort [13:01.080 --> 13:05.640] of what we ended up with in terms of architecture-specific versus generic code. [13:05.640 --> 13:10.400] So in the middle, the green part is the generic code that's shared between all architectures [13:10.400 --> 13:13.440] and it's 4,300 lines today. [13:13.440 --> 13:23.200] x86 adds 7,000 and some lines specific code and ARM to the right adds some 5,600 lines. [13:23.200 --> 13:29.200] So if you sum that up for x86, it's roughly 11,500 lines and for ARM it's less than 10,000 [13:29.200 --> 13:30.200] lines of code. [13:30.200 --> 13:40.200] It's very small and if you look at it, ballpark 40% of the code for each architecture is generic [13:40.200 --> 13:41.400] and shareable. [13:41.400 --> 13:46.440] And that's really great, not just from a maintainability perspective, but also from a verifiability [13:46.440 --> 13:51.680] perspective because you have to specify and verify those generic portions only once. [13:51.680 --> 13:58.560] If you compile that into binaries, then the resulting binaries are also very small, like [13:58.560 --> 14:05.840] a little less than 70K in code size and obviously if you use a different compiler version or [14:05.840 --> 14:10.360] different NOVA version, these numbers will slightly differ, but it gives you an idea [14:10.360 --> 14:15.920] of how small the code base and how small the binaries will be. [14:15.920 --> 14:21.880] So let's look at some interesting aspects of the architecture because assume you've [14:21.880 --> 14:25.840] downloaded NOVA, you've built such a small binary from source code and now you want to [14:25.840 --> 14:26.840] boot it. [14:26.840 --> 14:34.360] And typical boot procedure, both on x86 and ARM, which are converging towards using UFI [14:34.360 --> 14:39.680] as firmware, will basically have this structure where UFI firmware runs first and then invokes [14:39.680 --> 14:44.360] some bootloader, passing some information like an image handle and a system table and [14:44.360 --> 14:51.840] then the bootloader runs and invokes a NOVA microhypervisor, passing also the image handle [14:51.840 --> 14:56.560] and the system table maybe adding multi-boot information. [14:56.560 --> 15:01.560] And at some point there will have to be a platform handover of all the hardware from [15:01.560 --> 15:04.880] firmware to the operating system in our case NOVA. [15:04.880 --> 15:07.920] And this handover point is called exit boot services. [15:07.920 --> 15:13.920] It's basically the very last function that you call as either a bootloader or a kernel [15:13.920 --> 15:20.520] in firmware and that's the point where firmware stops accessing all the hardware and the ownership [15:20.520 --> 15:23.680] of the hardware basically transitions over to the kernel. [15:23.680 --> 15:28.920] And the unfortunate situation is that as you call exit boot services, firmware which may [15:28.920 --> 15:35.480] have enabled the IOMU or SMMU at boot time to protect against DMA attacks drops it at [15:35.480 --> 15:39.880] this point, which sounds kind of silly, but that's what happens. [15:39.880 --> 15:46.480] And the reason if you ask those who are familiar with UFI is for legacy OS support because [15:46.480 --> 15:51.560] UFI assumes that maybe the next stage is a legacy OS which can't deal with DMA protection [15:51.560 --> 15:58.000] so it gets turned off, which is really unfortunate because between the point where you call exit [15:58.000 --> 16:04.640] boot services to take over the platform hardware and the point where NOVA can actually enable [16:04.640 --> 16:10.120] the IOMU, there's this window of opportunity shown in red here where there's no DMA protections [16:10.120 --> 16:11.960] and that's the point. [16:11.960 --> 16:16.960] It's very small, maybe a few nanoseconds or microseconds where an attacker could perform [16:16.960 --> 16:19.120] a DMA attack. [16:19.120 --> 16:24.960] And for that reason, NOVA takes complete control of the exit boot services flow, so it's not [16:24.960 --> 16:30.600] the bootloader who calls exit boot services, NOVA actually drives the UFI infrastructure [16:30.600 --> 16:36.040] and it disables all busmaster activity before calling exit boot services so that we eliminate [16:36.040 --> 16:41.720] this window of opportunity. [16:41.720 --> 16:48.840] That was a very aggressive change in NOVA because it means NOVA has to comprehend UFI. [16:48.840 --> 16:51.880] The next thing that we added was a flexible load address. [16:51.880 --> 16:56.280] So when the bootloader wants to put a binary into physical memory, it invokes it with paging [16:56.280 --> 17:00.720] being disabled, which means you have to load it at some physical address. [17:00.720 --> 17:04.640] And you can define an arbitrary physical address but it would be good if whatever physical [17:04.640 --> 17:07.360] address you define worked on all the boards. [17:07.360 --> 17:11.840] And that is simply impossible, especially in the ARM ecosystem. [17:11.840 --> 17:17.720] So in ARM some platforms have the DRAM starting at physical address zero, some have MMIO starting [17:17.720 --> 17:23.040] at address zero, so you will not find a single physical address range that works across all [17:23.040 --> 17:29.960] ARM platforms where you can say always load NOVA at two megabytes, one gigabyte, whatever. [17:29.960 --> 17:33.000] So we made the load address flexible. [17:33.000 --> 17:37.080] Also the bootloader might want to move NOVA to a dedicated point in memory like at the [17:37.080 --> 17:41.880] very top so that the bottom portion can be given one to one to a VM. [17:41.880 --> 17:45.000] So the load address is now flexible for NOVA. [17:45.000 --> 17:49.880] Not fully flexible but you can move NOVA up down by arbitrary multiples of two megabytes [17:49.880 --> 17:52.720] so add super page boundaries. [17:52.720 --> 17:59.080] And the interesting insight into this is for pulling this off, there is no L3 location [17:59.080 --> 18:01.560] complexity required. [18:01.560 --> 18:06.320] NOVA consists of two sections, a very small init section which is mapped, which is identity [18:06.320 --> 18:11.760] map which means virtual addresses equal physical addresses and that's the code that initializes [18:11.760 --> 18:16.840] the platform up to the point where you can enable paging and then there's a runtime section [18:16.840 --> 18:22.480] which runs paged so it has virtual to physical memory mappings and for those virtual to physical [18:22.480 --> 18:28.360] memory mappings if you run this paging enabled the physical addresses that back these virtual [18:28.360 --> 18:30.760] memory ranges simply don't matter. [18:30.760 --> 18:34.560] So paging is basically some form of relocation. [18:34.560 --> 18:38.800] You only need to deal with relocation for the init section and you can solve that by [18:38.800 --> 18:42.640] making the init section be position independent code. [18:42.640 --> 18:46.920] And it's assembler anyway so making that position independent is not hard. [18:46.920 --> 18:52.640] We actually didn't make the code just position independent, it is also mode independent [18:52.640 --> 18:59.280] which means no matter if UEFI starts you in 32-bit mode or 64-bit mode that code is dealing [18:59.280 --> 19:03.080] with all these situations. [19:03.080 --> 19:08.160] There's a limit, an artificial limit of you still have to load NOVA below four gigabytes [19:08.160 --> 19:14.160] because multi-boot has been defined in such a way that you can't express addresses above [19:14.160 --> 19:20.960] four gigabytes because some of these structures are still 32-bit and that little emoticon expresses [19:20.960 --> 19:23.960] what we think of that. [19:23.960 --> 19:29.840] So then after we had figured this out we wanted to do some power management and this is an [19:29.840 --> 19:35.520] overview of all the power management that ACPI defines so ACPI defines a few global [19:35.520 --> 19:38.880] states like working, sleeping and off. [19:38.880 --> 19:44.480] Those aren't all that interesting, the really interesting states are the sleep states. [19:44.480 --> 19:48.720] And the things that have this black bold border around it is the state in which the system [19:48.720 --> 19:53.720] is when it's fully up and running, no idling, no sleeping, no nothing. [19:53.720 --> 19:58.000] It's called the S0 working state and then there's some sleep state. [19:58.000 --> 20:03.000] You might know suspend to run, suspend to disk and soft off and when you're in the [20:03.000 --> 20:08.560] S0 working state you can have a bunch of idle states and in the C0 idle state you can have [20:08.560 --> 20:13.600] a bunch of performance state which roughly correspond to voltage and frequency scaling [20:13.600 --> 20:17.160] so ramping up the clock speed up and down. [20:17.160 --> 20:21.520] So unfortunately we don't have a lot of time to go into all the details of these sleep [20:21.520 --> 20:27.440] states but I want to still say a few words about this. [20:27.440 --> 20:34.640] We implemented suspend resume on both x86 and ARM and there's two ways you can go about [20:34.640 --> 20:35.640] it. [20:35.640 --> 20:42.120] One which is I would say a brute force approach and the other which is the smart approach. [20:42.120 --> 20:46.080] And the brute force approach basically goes like you look at all the devices that lose [20:46.080 --> 20:51.800] their state during a suspend resume transition and you save their entire register state. [20:51.800 --> 20:55.360] And that's a significant amount of state that you have to manage and it may even be [20:55.360 --> 20:59.840] impossible to manage it because if you have devices with hidden internal state you may [20:59.840 --> 21:05.480] not be able to get it it or if the device has a hidden internal state machine you may [21:05.480 --> 21:09.440] not know what the internal state of that device is at that point. [21:09.440 --> 21:13.760] So it may be suitable for some generic devices like if you wanted to save the configuration [21:13.760 --> 21:18.080] space of every PCI device that's generic enough that you could do that. [21:18.080 --> 21:23.680] But for some interrupt controllers or SMM use with internal state that's not smart. [21:23.680 --> 21:28.920] So for that you can actually use the second approach which Nova uses which is you save [21:28.920 --> 21:33.600] a high level configuration and you initialize the device based on that. [21:33.600 --> 21:40.960] So as an example say you had an interrupt routed to core zero in edge triggered mode. [21:40.960 --> 21:45.680] You would save that as a high level information and that's sufficient to reinitialize all [21:45.680 --> 21:50.480] the interrupt controllers all the redirection entries all the trigger modes based on just [21:50.480 --> 21:51.760] this bit of information. [21:51.760 --> 21:58.560] So there's lots less information to maintain saving becomes basically a know up restoring [21:58.560 --> 22:04.280] can actually use the same code pass that you used to initially bring up that particular [22:04.280 --> 22:08.920] device and that's the approach for all the interrupt controllers all the SMM use all [22:08.920 --> 22:13.160] the devices managed by Nova. [22:13.160 --> 22:18.280] The next thing I want to briefly talk about is P states performance states which are these [22:18.280 --> 22:28.680] gears for ramping up the clock speed on x86 and Nova can now deal with all these P states. [22:28.680 --> 22:33.800] The interesting aspect is that most modern x86 processors have something called Turbo [22:33.800 --> 22:39.960] mode and Turbo mode allows one or more processors to exceed the nominal clock speed to actually [22:39.960 --> 22:44.200] turbo up higher if other cores are idle. [22:44.200 --> 22:49.720] So if other cores are not using their thermal or power headroom is elected set of course [22:49.720 --> 22:54.080] maybe just one core maybe a few other cores can actually turbo up many bins and this is [22:54.080 --> 22:59.200] shown here on active core zero which basically gets the thermal headroom of core one core [22:59.200 --> 23:02.080] two and core three to clock up higher. [23:02.080 --> 23:06.720] So Nova will exploit that feature when it's available but there are situations where you [23:06.720 --> 23:11.560] want predictable performance where you want every core to run at its guaranteed high frequency [23:11.560 --> 23:17.560] mode and there's a command line parameter that you can set that basically clamps the [23:17.560 --> 23:21.000] maximum speed to the guaranteed frequency. [23:21.000 --> 23:28.640] You could also lower the frequency to something less than the guaranteed frequency there's [23:28.640 --> 23:33.400] a point an operating point it's called maximum efficiency and there's even points below that [23:33.400 --> 23:38.080] where you can clock really high but then it's actually less efficient than this point. [23:38.080 --> 23:40.760] So all of that is also supported. [23:40.760 --> 23:46.920] So as an overview from a feature comparison perspective ARM versus x86 we support p-states [23:46.920 --> 23:54.040] on x86 not on ARM because there's no generic interface on ARM yet we support all the s-states [23:54.040 --> 24:02.440] on x86 like stop clock, suspend, resume, hibernation, power off, platform reset. [24:02.440 --> 24:09.840] On ARM there's no such concept as one but we also support suspend, resume and suspend [24:09.840 --> 24:15.760] to disk if it's supported and what does it mean if it's supported it means if platform [24:15.760 --> 24:20.920] firmware like psci implements it and there are some features that are mandatory and some [24:20.920 --> 24:22.120] features that are optional. [24:22.120 --> 24:28.000] So suspend, resume for example works great on the nxpimx8m that Stefan had for his demo [24:28.000 --> 24:33.960] it doesn't work so great on Raspberry Pi because the firmware simply has no support for jumping [24:33.960 --> 24:37.040] back to the operating system after a suspend. [24:37.040 --> 24:40.040] So it's not a novel limitation. [24:40.040 --> 24:45.120] There's a new suspend feature called low power idle which we don't support yet because it [24:45.120 --> 24:50.480] requires way more support than just Nova basically requires powering down the GPU, powering down [24:50.480 --> 24:56.000] all the devices, powering down all the links so this is a concerted platform effort. [24:56.000 --> 25:03.520] But from a hypercore perspective the hypercore that you would invoke to transition the platform [25:03.520 --> 25:07.720] to a sleep state is called control hardware and whenever you try to invoke it with something [25:07.720 --> 25:12.360] that's not supported it returns bad feature and for the hypercodes that assign devices [25:12.360 --> 25:20.160] or interrupts the state that the system had when you assign devices or interrupts to particular [25:20.160 --> 25:25.880] domains will completely be preserved across the suspend, resume codes using this safety [25:25.880 --> 25:29.120] high level state approach. [25:29.120 --> 25:35.840] So next I'll talk about some radical API change that we made and being a micro kernel and [25:35.840 --> 25:40.080] not being Linux we don't have to remain backward compatible. [25:40.080 --> 25:46.560] So that's one of these major API changes that took quite a lot of time to implement. [25:46.560 --> 25:52.800] What we had in the past was basically an interface with five kernel objects. [25:52.800 --> 25:57.160] Protection domains, execution context, scheduling context, portals and sum of course and every [25:57.160 --> 26:00.120] protection domain looked as shown on this slide. [26:00.120 --> 26:07.480] It actually had six resource spaces built into it, an object space which hosts capabilities [26:07.480 --> 26:12.760] to all the kernel objects that you have access to, a host space which represents the stage [26:12.760 --> 26:19.760] one page table, a guest space which represents the stage two guest page table, the DMA space [26:19.760 --> 26:26.400] for memory transactions that are remapped by the IOMU, port IO space and an MSR space. [26:26.400 --> 26:31.200] So all of these existed in one single instance in every protection domain and when you created [26:31.200 --> 26:36.200] a host EC, a guest EC, like a virtual CPU or device they were automatically bound to [26:36.200 --> 26:40.360] the PD and picking up the spaces that they needed. [26:40.360 --> 26:45.800] And that is, that worked great for us for more than 10 years but it turned out to be [26:45.800 --> 26:50.400] suboptimal for some more advanced use cases like nested virtualization. [26:50.400 --> 26:56.320] If you run a hypervisor inside a virtual machine and that hypervisor creates multiple guests [26:56.320 --> 26:59.960] itself then you suddenly need more than one guest space. [26:59.960 --> 27:02.560] You need one guest space per sub guest. [27:02.560 --> 27:09.200] So you need multiple of these yellow guest spaces or when you virtualize the SMMU and [27:09.200 --> 27:14.440] the SMMU has multiple contexts and every context has its own page table then you suddenly need [27:14.440 --> 27:16.600] more than one DMA space. [27:16.600 --> 27:21.320] So you need more of these blue boxes and the same can be said for port IO and MSR spaces. [27:21.320 --> 27:28.120] So how do we get more than one if the protection domain has all these single instance? [27:28.120 --> 27:35.280] So what we did and it was quite a major API in internal reshuffling is we separated these [27:35.280 --> 27:37.120] spaces from the protection domain. [27:37.120 --> 27:38.920] They are now new first class objects. [27:38.920 --> 27:45.840] So Nova just got six new kernel objects that when you create them you get individual capabilities [27:45.840 --> 27:51.720] for them and you can manage them independently from the protection domain. [27:51.720 --> 27:57.240] So the way that this works is first you create a protection domain with create PD then you [27:57.240 --> 28:01.280] create one or more of these spaces again with create PD. [28:01.280 --> 28:03.440] So that's a sub function of create PD. [28:03.440 --> 28:07.960] And then you create an EC like a host EC and it binds to those spaces that are relevant [28:07.960 --> 28:08.960] for host EC. [28:08.960 --> 28:15.280] So a host EC like a hypostrat needs capabilities so it needs an object space it binds to that [28:15.280 --> 28:20.080] it needs a stage one page table so it binds to that and it needs access to ports so it [28:20.080 --> 28:24.800] binds to that on x86 only because on ARM there's no such thing. [28:24.800 --> 28:29.040] So for host thread all these assignments are static. [28:29.040 --> 28:32.360] We could make them flexible but we have not found a need. [28:32.360 --> 28:38.000] Gets more interesting for a guest EC which is a virtual CPU that runs in a guest. [28:38.000 --> 28:41.520] So again the sequence is the same you first create a protection domain then you create [28:41.520 --> 28:47.160] one or more of these spaces and when you create the virtual CPU it binds to those spaces that [28:47.160 --> 28:50.440] it urgently needs which is the object space and the host space. [28:50.440 --> 28:55.040] It does not yet bind to any of the flexible spaces shown to the right. [28:55.040 --> 29:02.280] And that binding is established on the startup IPC during IPC reply. [29:02.280 --> 29:07.560] You pass selectors, capability selectors to these spaces that you want to attach to [29:07.560 --> 29:13.320] and then you flexibly bind to those spaces as denoted by these dashed lines. [29:13.320 --> 29:16.960] And that assignment can be changed on every event. [29:16.960 --> 29:22.920] So every time you take a VM exit Nova synthesizers and exception IPC or architectural IPC sends [29:22.920 --> 29:29.440] it to the VMM for handling and when the VMM replies it can set a bit in the message transfer [29:29.440 --> 29:34.440] descriptor to say I want to change the space assignment it passes new selectors and then [29:34.440 --> 29:40.320] you can flexibly switch between those spaces and that allows us to implement for example [29:40.320 --> 29:42.880] nested virtualization. [29:42.880 --> 29:49.440] The same for a device which in x86 is represented by a bus device function or an arm is represented [29:49.440 --> 29:51.560] by a stream ID. [29:51.560 --> 29:59.080] The assigned depth hypercall can flexibly rebind the device to a DMA space at any time. [29:59.080 --> 30:05.080] So that took quite a while to implement but it gives us so much more flexibility and I [30:05.080 --> 30:09.960] heard that some of the Nova forks have come across the same problem so maybe that's something [30:09.960 --> 30:13.280] that could work for you too. [30:13.280 --> 30:18.160] So let's talk about page tables and I mentioned earlier that page tables are actually generic [30:18.160 --> 30:21.880] code which is somewhat surprising. [30:21.880 --> 30:26.120] Nova manages three page tables per architecture, the stage one which is the host page table, [30:26.120 --> 30:31.400] the stage two which is the guest page table and a DMA page table which is used by the [30:31.400 --> 30:36.800] IOMU and these correspond to the three memory spaces that I showed in the previous slide. [30:36.800 --> 30:42.400] And the way we made this page table code architecture independent is by using a template base class [30:42.400 --> 30:48.480] which is completely lockless so it's very scalable and the reason why it can be lockless [30:48.480 --> 30:52.640] is because the MMU doesn't honor any software locks anyway so if you put a lock around your [30:52.640 --> 30:56.480] page table infrastructure the MMU wouldn't know anything about those locks so it has [30:56.480 --> 31:03.080] to be written in a way that it does atomic transformations anyway so that the MMU never [31:03.080 --> 31:09.160] sees an inconsistent state and once you have this there's also no need to put the lock [31:09.160 --> 31:13.600] around it for any software updates so that's completely lock free. [31:13.600 --> 31:17.600] And that architecture independent base class deals with all the complexities of allocating [31:17.600 --> 31:24.120] and deallocating page tables, splitting superpages into page tables or overmapping page tables [31:24.120 --> 31:31.480] with superpages and you can derive architecture specific subclasses from it and the subclasses [31:31.480 --> 31:36.800] themselves inject themselves as a parameter to the base class that's called the curiously [31:36.800 --> 31:38.880] recurring template pattern. [31:38.880 --> 31:42.960] And the subclasses then do the transformation between the high level attributes like this [31:42.960 --> 31:48.360] page is readable, writable, user accessible, whatever into the individual bits and coding [31:48.360 --> 31:54.240] of the page table entries as that architecture needs it and also there are some coherency [31:54.240 --> 32:00.760] requirements on ARM and some coherency requirements between SMM use that don't snoop the caches [32:00.760 --> 32:05.600] so these architecture specific subclasses deal with all that complexity but it allows [32:05.600 --> 32:12.640] us to share the page table class and to specify and verify it only once. [32:12.640 --> 32:17.600] So let's look at page tables in a little bit more detail because there's some interesting [32:17.600 --> 32:19.600] stuff you need to do on ARM. [32:19.600 --> 32:24.240] So most of you who've been in an OS class or who've written a microconnet will have [32:24.240 --> 32:29.000] come across this page table format where an input address like a host virtual or guest [32:29.000 --> 32:35.400] physical address is split up into an offset portion into the final page 12 bits and then [32:35.400 --> 32:41.720] you have nine bits indexing into the individual levels of the page table. [32:41.720 --> 32:47.640] So when an address is transformed by the MMU into virtual address into physical address [32:47.640 --> 32:53.120] the MMU first uses bits 30 to 38 to index into the level two page table to find the [32:53.120 --> 32:58.480] level one and then to find the level zero and the walk can terminate early. [32:58.480 --> 33:02.360] You can have a leaf page at any level so it gives you one gigabyte, two megabyte or [33:02.360 --> 33:09.160] four k superpages and with that page table structure like this three levels you can create [33:09.160 --> 33:15.360] an address space of 512 gigabytes of size and that should be good enough but it turns [33:15.360 --> 33:21.640] out we came across several ARM platforms which have an address space size of one terabyte. [33:21.640 --> 33:28.960] So twice that they need one extra bit which you can't represent with 39 bits so you have [33:28.960 --> 33:30.920] a 40 bit address space. [33:30.920 --> 33:34.480] So what would you do if you were designing a chip? [33:34.480 --> 33:40.640] You would expect that it would just open a new level here and that you get a four level [33:40.640 --> 33:46.520] page table but ARM decided differently because they said if I just add one bit the level [33:46.520 --> 33:51.920] three page table would have just two entries and that's not worse building basically another [33:51.920 --> 33:54.320] level into it. [33:54.320 --> 33:59.760] So what they did is they came up with a concept called concatenated page table and it makes [33:59.760 --> 34:04.320] the level two page table twice as large by adding another bit at the top. [34:04.320 --> 34:09.280] So now suddenly the level two page table has 10 bits of indexing and the backing page [34:09.280 --> 34:13.440] table has 1024 entries and is 8k in size. [34:13.440 --> 34:18.040] And this concept was extended so if you go to 41 address space again you get one additional [34:18.040 --> 34:21.840] bit and the page table gets larger and this keeps going on. [34:21.840 --> 34:28.400] It can extend to up to four bits that the level two page table is 64k in size. [34:28.400 --> 34:36.080] And there's no way around it, the only time at which you can actually open the level three [34:36.080 --> 34:38.440] is when you exceed 44 bits. [34:38.440 --> 34:44.560] And then when you get 44 bits you can go to a four level and it looks like this. [34:44.560 --> 34:50.240] So the functionality that we also had to add to NOVA is to comprehend this concatenated [34:50.240 --> 34:56.400] page table format so that we can deal with arbitrary address space sizes on ARM. [34:56.400 --> 35:01.200] And we actually had a device, I think it was a Xilin CCO one or two which had something [35:01.200 --> 35:07.520] mapped above 512 gigabytes and just below one terabyte and you can't pass that through [35:07.520 --> 35:11.040] to a guest if you don't have concatenated page sheets. [35:11.040 --> 35:16.520] So the generic page table cluster we have right now is so flexible that it can basically [35:16.520 --> 35:21.120] do what's shown on this slide and the simple case is x86. [35:21.120 --> 35:25.120] You have three level, four level, or five level page tables with a uniform structure [35:25.120 --> 35:28.320] of nine bits per level and 12 offset bits. [35:28.320 --> 35:33.840] 39 isn't used by the MMU but might be used by the SMMU and the MMU typically uses four [35:33.840 --> 35:39.880] levels and in high end boxes like servers for 57. [35:39.880 --> 35:48.360] On ARM, depending on what type of SOC you have it either has something between 32 or [35:48.360 --> 35:56.560] up to 52 physical address bits and the table shows the page table level split, indexing [35:56.560 --> 36:02.800] split that NOVA has to do and all these colored boxes are basically instances of concatenated [36:02.800 --> 36:04.040] page tables. [36:04.040 --> 36:09.800] So 42 would require three bits to be concatenated, here we have four, here we have one, here [36:09.800 --> 36:14.800] we have two, so we really have to exercise all of those and we support all of those. [36:14.800 --> 36:22.360] And unlike the past where NOVA said page tables is so many levels per so many bits, we now [36:22.360 --> 36:28.280] have turned this around by saying the page table covers so many bits and we can compute [36:28.280 --> 36:34.920] the number of bits per level and the concatenation at the top level automatically in the code. [36:34.920 --> 36:39.120] So that was another fairly invasive change. [36:39.120 --> 36:45.840] While we were at re-architecting all the page tables, we took advantage of a new feature [36:45.840 --> 36:51.880] that Intel added to Islake servers and to all the Lake desktop platforms which is called [36:51.880 --> 36:55.400] total memory encryption with multiple keys. [36:55.400 --> 36:59.480] And what Intel did there is they repurposed certain bits of the physical address in the [36:59.480 --> 37:07.120] page table entry, the top bits shown here as key ID bits and so it's stealing some bits [37:07.120 --> 37:15.760] from the physical address and the key ID bits index into a key programming table shown here [37:15.760 --> 37:22.720] that basically select a slot and let's say you have four key bits that gives you 16 keys, [37:22.720 --> 37:27.160] two to the power of four, so your key indexing or your key programming table would have the [37:27.160 --> 37:29.880] opportunity to program 16 different keys. [37:29.880 --> 37:36.000] We've also come across platforms that have six bits, it's basically flexible how many [37:36.000 --> 37:41.720] bits are stolen from the physical address can vary per platform depending on how many [37:41.720 --> 37:46.440] keys are supported and those keys are used by a component called the memory encryption [37:46.440 --> 37:47.440] engine. [37:47.440 --> 37:55.480] The memory encryption engine sits at the perimeter of the package or the socket, basically at [37:55.480 --> 38:01.800] the boundary where data leaves the chip that you plug in the socket and enters the interconnect [38:01.800 --> 38:03.280] and enters RAM. [38:03.280 --> 38:09.200] So inside this green area which is inside the SOC, everything is unencrypted in the [38:09.200 --> 38:14.160] cores, in the caches, in the internal data structure, but as it leaves the die and moves [38:14.160 --> 38:18.960] out to the interconnect, it gets encrypted automatically by the memory encryption engine [38:18.960 --> 38:25.000] with the key and this example shows a separate key being used for each virtual machine which [38:25.000 --> 38:29.400] is a typical use case but it's actually very more flexible than that, you can select the [38:29.400 --> 38:31.720] key on a per page basis. [38:31.720 --> 38:36.920] So you could even say if there was a need for these two VMs to share some memory that [38:36.920 --> 38:43.040] some blue pages would appear here and some yellow pages would appear here, that's possible. [38:43.040 --> 38:48.720] So we added support in the page tables for encoding these key ID bits, we added support [38:48.720 --> 38:55.800] for using the P-config instruction for programming keys into the memory encryption engine and [38:55.800 --> 39:00.160] the keys can come in two forms, you can either randomly generate them, in which case Nova [39:00.160 --> 39:05.480] will also drive the digital random number generator to generate entropy or you can program [39:05.480 --> 39:06.640] tenant keys. [39:06.640 --> 39:11.600] So you can say I want to use this particular AS key for encrypting the memory and that's [39:11.600 --> 39:17.000] useful for things like VM migration where you want to take an encrypted VM and move [39:17.000 --> 39:20.640] it from one machine to another. [39:20.640 --> 39:25.160] And the reason why Intel introduced this feature is for confidential computing but [39:25.160 --> 39:33.040] also because DRAM is slowly moving towards non-volatile RAM and offline either made a [39:33.040 --> 39:38.800] tag or so where somebody unplugged your RAM or takes your non-volatile RAM and then looks [39:38.800 --> 39:43.160] at it in another computer is a big problem and they can still unplug your RAM but they [39:43.160 --> 39:46.840] would only see ciphertext. [39:46.840 --> 39:53.920] So next thing we looked at was, so this was more of a confidentiality improvement, next [39:53.920 --> 40:03.480] thing we looked at is improving the availability and we added some support for dealing with [40:03.480 --> 40:05.160] noisy neighbor domains. [40:05.160 --> 40:07.120] So what are noisy neighbor domains? [40:07.120 --> 40:11.840] Let's say you have a quad core system as shown on this slide and you have a bunch of virtual [40:11.840 --> 40:13.800] machines as shown at the top. [40:13.800 --> 40:19.000] On some cores you may over provision the cores run more than one VM like on core zero and [40:19.000 --> 40:20.600] core one. [40:20.600 --> 40:25.440] For some use cases you might want to run a single VM on a core only like a real time [40:25.440 --> 40:33.120] VM which is exclusively assigned to core two but then on some cores like shown on far right [40:33.120 --> 40:39.040] you may have a VM that's somewhat misbehaving and somewhat misbehaving means it uses excessive [40:39.040 --> 40:43.760] amounts of memory and basically evicts everybody else out of the cache. [40:43.760 --> 40:49.280] So if you look at the last level cache portion here, the amount of cache that is assigned [40:49.280 --> 40:54.960] to the noisy VM is very disproportionate to the amount of cache given to the other VM [40:54.960 --> 40:58.560] simply because this is trampling all over memory. [40:58.560 --> 41:05.160] And this is very undesirable from a predictability perspective especially if you have a VM like [41:05.160 --> 41:10.240] the green one that's real time which may want to have most of its working set in the cache. [41:10.240 --> 41:12.720] So is there something we can do about it? [41:12.720 --> 41:14.560] And yes there is. [41:14.560 --> 41:17.280] It's called cat. [41:17.280 --> 41:23.920] Cat is Intel's acronym for cache allocation technology and what they added in the hardware [41:23.920 --> 41:26.320] is a concept called class of service. [41:26.320 --> 41:30.840] And you can think of class of service as a number and again like the key idea is there's [41:30.840 --> 41:37.040] a limited number of classes of service available like four or sixteen and you can assign this [41:37.040 --> 41:40.320] class of service number to each entity that shares the cache. [41:40.320 --> 41:45.760] So you could make it a property of a protection domain or a property of a thread. [41:45.760 --> 41:51.680] And for each of the classes of service you can program a capacity bit mask which says [41:51.680 --> 41:55.920] what proportion of the cache can this class of service use? [41:55.920 --> 41:59.840] Can it use 20%, 50% and even which portion? [41:59.840 --> 42:05.480] There are some limitations like the bit mask must be contiguous but they can overlap for [42:05.480 --> 42:12.400] sharing and there's a model specific register which is not cheap to program where you can [42:12.400 --> 42:15.280] say this is the active class of service on this core right now. [42:15.280 --> 42:19.360] So this is something you would have to contact switch to say I'm now using something else. [42:19.360 --> 42:25.320] And when you use this it improves the predictability like the worst case execution time quite nicely [42:25.320 --> 42:27.600] and that's what it was originally designed for. [42:27.600 --> 42:33.760] But it turns out it also helps tremendously with dealing with cache side channel attacks [42:33.760 --> 42:39.600] because if you can partition your cache in such a way that your attacker doesn't allocate [42:39.600 --> 42:46.440] into the same ways as the VM you're trying to protect then all the flush and reload attacks [42:46.440 --> 42:48.200] simply don't work. [42:48.200 --> 42:55.800] So here's an example for how this works and to the right I've shown an example number [42:55.800 --> 43:05.600] of six classes of service and a cache which has 20 ways. [43:05.600 --> 43:09.800] And you can program and this is again just an example you can program the capacity bit [43:09.800 --> 43:15.440] mask for each class of service for example to create full isolation so you could say [43:15.440 --> 43:23.240] class of service gets 40% of the cache, weighs 0 to 7 and class of service 1 gets 20% and [43:23.240 --> 43:28.440] everybody else gets 10% and these capacity bit masks don't overlap at all which means [43:28.440 --> 43:32.360] you get zero interference through the level 3 cache. [43:32.360 --> 43:37.040] You could also program them to overlap. [43:37.040 --> 43:42.800] There's another mode which is called CDP code and data prioritization which splits the number [43:42.800 --> 43:48.720] of classes of service in half and basically redefines the meaning of the bit mask to say [43:48.720 --> 43:52.800] those with an even number are for data and those with an odd number are for code. [43:52.800 --> 43:57.400] So you can even discriminate how the cache is being used between code and data and gives [43:57.400 --> 44:04.840] you more fine-grained control and the NOVA API forces users to declare upfront whether [44:04.840 --> 44:10.720] they want to use CAT or CDP to partition their cache and only after you've made that decision [44:10.720 --> 44:13.360] can you actually configure the capacity bit masks. [44:13.360 --> 44:15.560] So with CDP it would look like this. [44:15.560 --> 44:21.560] You get three classes of service instead of six, distinguished between D and C, data and [44:21.560 --> 44:28.840] code and you could for example say class of service 1 as shown on the right gets 20% of [44:28.840 --> 44:35.920] the cache for data, 30% of cache for the code, so 50% of the capacity in total exclusively [44:35.920 --> 44:41.040] assigned to anybody who's class of service 1 and the rest shares capacity bit masks and [44:41.040 --> 44:45.680] here you see an example of how the bit masks can overlap and wherever they overlap the cache [44:45.680 --> 44:49.360] capacity is being competitively shared. [44:49.360 --> 44:52.880] So that's also a new feature that we support right now. [44:52.880 --> 44:59.000] Now the question is class of service is something you need to assign to cache sharing entities. [44:59.000 --> 45:02.240] To what type of object do you assign that? [45:02.240 --> 45:04.440] And you could assign that to a protection domain. [45:04.440 --> 45:10.280] You could say every box on the architecture slide gets assigned a certain class of service [45:10.280 --> 45:15.920] and the question is then what do you assign to a server that has multiple clients? [45:15.920 --> 45:20.200] It's really unfortunate and what also means is if you have a protection domain that spends [45:20.200 --> 45:25.520] multiple cores and you say I want this protection domain to use 40% of the cache, you have to [45:25.520 --> 45:29.840] program the class of service settings on all cores the same way. [45:29.840 --> 45:31.960] So it's really a loss of flexibility. [45:31.960 --> 45:36.480] So that wasn't our favorite choice and we said maybe we should assign class of service [45:36.480 --> 45:38.800] to execution contexts instead. [45:38.800 --> 45:43.280] And again the question is what class of service do you assign to a server execution context [45:43.280 --> 45:48.600] that does work on behalf of clients and the actual killer argument was that you would [45:48.600 --> 45:53.240] need to set the class of service in this model specific register again during each context [45:53.240 --> 45:56.560] switch which is really bad for performance. [45:56.560 --> 46:00.160] So even option two is not what we went for. [46:00.160 --> 46:04.680] Instead we made the class of service a property of the scheduling context and that has very [46:04.680 --> 46:05.920] nice properties. [46:05.920 --> 46:10.440] We only need to context switch it during scheduling decisions so the cost of reprogramming that [46:10.440 --> 46:17.440] MSR is really not relevant anymore and it extends the already existing model of time [46:17.440 --> 46:20.880] and priority donation with class of service donation. [46:20.880 --> 46:25.360] So a server does not need to have a class of service assigned to it at all. [46:25.360 --> 46:27.240] It uses the class of service of its client. [46:27.240 --> 46:34.800] So if let's say your server implements some file system or so, then the amount of cache [46:34.800 --> 46:38.880] it can use depends on whether your client can use a lot of cache or whether your client [46:38.880 --> 46:40.560] cannot use a lot of cache. [46:40.560 --> 46:45.200] So it's a nice extension of an existing feature and the additional benefit is that the classes [46:45.200 --> 46:48.880] of service can be programmed differently per core. [46:48.880 --> 46:54.720] So 8 cores times 6 classes of service gives you 48 classes of service in total instead [46:54.720 --> 46:58.200] of 6. [46:58.200 --> 47:01.520] So that was a feature for availability. [47:01.520 --> 47:06.720] We also added some features for integrity and if you look at the history, there's a [47:06.720 --> 47:12.440] long history of features being added to paging that improve the integrity of code against [47:12.440 --> 47:13.920] injection attacks. [47:13.920 --> 47:20.840] And it all started out many years ago with these 64-bit architecture where you could [47:20.840 --> 47:28.640] mark pages non-executable and you could basically enforce that pages are either writable or [47:28.640 --> 47:30.640] executable but never both. [47:30.640 --> 47:33.720] So there's no confusion between data and code. [47:33.720 --> 47:38.680] And then over the years, more features were added like supervisor mode execution prevention [47:38.680 --> 47:45.400] where if you use that feature, kernel code can never jump into a user page and be confused [47:45.400 --> 47:47.960] as executing some user code. [47:47.960 --> 47:51.800] And then there's another feature called supervisor mode access prevention which even says kernel [47:51.800 --> 47:57.680] code can never without explicitly declaring that it wants to do that, read some user data [47:57.680 --> 47:59.680] page. [47:59.680 --> 48:04.560] So all of these tighten the security and naturally Nova supports them. [48:04.560 --> 48:10.000] There's a new one called mode-based execution control which is only relevant for guest page [48:10.000 --> 48:14.160] tables or stage two which gives you two separate execution bits. [48:14.160 --> 48:15.840] So there's not a single X bit. [48:15.840 --> 48:21.040] There's now executable for user and executable for super user. [48:21.040 --> 48:25.760] And that is a feature that ultra security can, for example, use where we can say even [48:25.760 --> 48:30.680] if the guest screws up its page tables, it's stage one page tables, the stage two page [48:30.680 --> 48:39.000] tables can still say Linux user applications or Linux kernel code can never execute Linux [48:39.000 --> 48:44.240] user application code if it's marked as XS in the stage two page table. [48:44.240 --> 48:48.680] So it's again a feature that can tighten the security of guest operating systems from the [48:48.680 --> 48:49.960] host. [48:49.960 --> 48:55.160] But even if you have all that, there are still opportunities for code injection and these [48:55.160 --> 48:59.960] classes of attacks basically reuse existing code snippets and chain them together in interesting [48:59.960 --> 49:04.800] ways using control flow hijacking like Rob attacks. [49:04.800 --> 49:10.360] And I'm not sure who's familiar with Rob attacks is basically you create a call stack [49:10.360 --> 49:15.120] with lots of return addresses that chain together simple code snippets like add this register [49:15.120 --> 49:20.080] return, multiply this register return, jump to this function return. [49:20.080 --> 49:25.320] And by chaining them all together, you can build programs out of existing code snippets [49:25.320 --> 49:26.320] that do what the attacker wants. [49:26.320 --> 49:27.680] You don't have to inject any code. [49:27.680 --> 49:32.800] You simply find snippets in existing code that do what you what you want. [49:32.800 --> 49:34.960] And this doesn't work so well on arm. [49:34.960 --> 49:39.840] It still works on arm, but on arm, the instruction length is flex is fixed to four bytes. [49:39.840 --> 49:42.440] So you can't jump into the middle of instructions. [49:42.440 --> 49:49.200] But on x86 with a flexible instruction size, you can even jump into the middle of instructions [49:49.200 --> 49:53.360] and completely reinterpret what existing code looks like. [49:53.360 --> 49:55.320] And that's quite unfortunate. [49:55.320 --> 50:01.920] So there's feature that tightens the security around that and it's called control flow enforcement [50:01.920 --> 50:04.640] technology or CT. [50:04.640 --> 50:11.080] And that feature adds integrity to the control flow graph, both to the forward edge and to [50:11.080 --> 50:16.760] the backward edge and forward edge basically means you protect jumps or calls that jump [50:16.760 --> 50:19.600] from one location forward to somewhere else. [50:19.600 --> 50:25.320] And the way that this works is that the legitimate jump destination where you want the jump to [50:25.320 --> 50:30.480] land, this landing pad, must have a specific end branch instruction placed there. [50:30.480 --> 50:34.920] And if you try to jump to a place which doesn't have an end branch landing pad, then you get [50:34.920 --> 50:37.640] a control flow violation exception. [50:37.640 --> 50:42.640] So you need the help of the compiler to put that landing pad at the beginning of every [50:42.640 --> 50:48.480] legitimate function and luckily GCC and other compilers have had that support for quite [50:48.480 --> 50:49.480] a while. [50:49.480 --> 50:52.480] So GCC sends eight and we are now at 12. [50:52.480 --> 50:54.640] So that works for forward edges. [50:54.640 --> 50:58.080] For backward edges, there's another feature called shadow stack. [50:58.080 --> 51:04.120] And that protects the return addresses on your stack and we'll have an example later. [51:04.120 --> 51:09.600] And it basically has a shadow call stack which you can't write to. [51:09.600 --> 51:16.880] It's protected by paging and if it's writable, then it won't be usable as a shadow stack. [51:16.880 --> 51:23.600] And you can independently compile Nova with branch protection, with return address protection [51:23.600 --> 51:25.760] or both. [51:25.760 --> 51:30.320] So let's look at indirect branch tracking and I try to come up with a good example and [51:30.320 --> 51:35.440] I actually found a function in Nova which is suitable to explaining how this works. [51:35.440 --> 51:42.280] Nova has a body allocator that can allocate contiguous chunks of memory and that body [51:42.280 --> 51:47.920] allocator has a free function where you basically return an address and say free this block. [51:47.920 --> 51:53.120] And the function is really as simple as shown there, it just consists of these few instructions [51:53.120 --> 51:57.880] because it's a tail call that jumps to some coalescing function here later and you don't [51:57.880 --> 52:03.440] have to understand all the complicated assembler but suffice it to say that there's a little [52:03.440 --> 52:09.160] test here of these two instructions which performs some meaningful check and you know [52:09.160 --> 52:11.520] that you can't free a null pointer. [52:11.520 --> 52:15.960] So this test checks if the address passed as the first parameter is a null pointer and [52:15.960 --> 52:18.440] if so it jumps out right here. [52:18.440 --> 52:22.960] So basically the function does nothing, does no harm, it's basically a knob. [52:22.960 --> 52:27.720] Let's say an attacker actually wanted to compromise memory and instead of jumping to the beginning [52:27.720 --> 52:32.640] of this function, it wanted to jump past that check to this red instruction to bypass the [52:32.640 --> 52:34.920] check and then corrupt memory. [52:34.920 --> 52:39.200] Without control flow enforcement that would be possible if the attacker could gain execution [52:39.200 --> 52:45.200] but with control flow it wouldn't work because when you do a call or a jump you have to land [52:45.200 --> 52:48.920] on an end branch instruction and the compiler has put that instruction there. [52:48.920 --> 52:53.840] So if an attacker managed to get control and tried to jump to a vtable or some indirect [52:53.840 --> 52:59.360] pointer to this address, you would immediately crash. [52:59.360 --> 53:03.560] So this is how indirect branch tracking works. [53:03.560 --> 53:06.720] Shadow stacks work like this. [53:06.720 --> 53:10.480] With a normal data stack you have your local variables on your stack, you have the parameters [53:10.480 --> 53:14.240] for the next function on the stack, so the green function wants to call the blue function [53:14.240 --> 53:18.000] and then when you do the call instruction the return address gets put on your stack. [53:18.000 --> 53:22.400] Then the blue function puts its local variables on a stack, wants to call the yellow function, [53:22.400 --> 53:25.800] puts the parameters for the yellow function on the stack, calls the yellow function so [53:25.800 --> 53:29.400] the return address for the blue function gets put on a stack. [53:29.400 --> 53:33.840] And you see in the stack grows downward and you see that the return address always lives [53:33.840 --> 53:35.320] above the local variables. [53:35.320 --> 53:40.600] So if your local variables, if you allocate an array on a stack and you don't have proper [53:40.600 --> 53:45.120] bounds checking, it's possible to override the return address by writing past the array [53:45.120 --> 53:50.800] and this is a popular attack technique, buffer overflow exploits that you find in the wild. [53:50.800 --> 53:59.160] So if you have code that is potentially susceptible to these kind of return address overrides, [53:59.160 --> 54:01.320] then you could benefit from shadow stacks. [54:01.320 --> 54:07.240] And the way that this works is there's a separate stack, this shadow stack, which is protected [54:07.240 --> 54:11.360] by paging so you can't write to it with any ordinary memory instructions, it's basically [54:11.360 --> 54:16.840] invisible and the only instructions that can write to it are call and read instructions [54:16.840 --> 54:19.200] and some shadow management instructions. [54:19.200 --> 54:22.840] And when the green function calls the blue function, the return address will not just [54:22.840 --> 54:27.600] be put on the ordinary data stack, but will additionally be put on the shadow stack and [54:27.600 --> 54:29.960] likewise with the blue and the yellow return address. [54:29.960 --> 54:34.560] And whenever you execute a return instruction, the hardware will compare the two return addresses [54:34.560 --> 54:38.880] that it pops off the two stacks and if they don't match, you again get a control flow [54:38.880 --> 54:40.880] violation. [54:40.880 --> 54:46.680] So that way, you can protect the backward edge of the control flow graph also using [54:46.680 --> 54:47.680] shadow stacks. [54:47.680 --> 54:52.160] There's a feature that NOVA uses on Tiger Lake and all the lake and platforms beyond [54:52.160 --> 54:55.960] that that have this feature. [54:55.960 --> 54:56.960] But there's a problem. [54:56.960 --> 55:05.880] And the problem is that using shadow stack instructions is possible on newer CPUs that [55:05.880 --> 55:10.280] have these instructions, that basically have this ISA extension, but if you have a binary [55:10.280 --> 55:16.240] containing those instructions, it would crash on older CPUs that don't comprehend that. [55:16.240 --> 55:20.840] And luckily, Intel defined the end branch instruction to be a knob, but some shadow stack [55:20.840 --> 55:22.680] instructions are not knobs. [55:22.680 --> 55:31.920] So if you try to execute a CET and able NOVA binary on something older without other effort, [55:31.920 --> 55:32.920] it might crash. [55:32.920 --> 55:35.020] So obviously, we don't want that. [55:35.020 --> 55:44.040] So what NOVA does instead, it detects at runtime whether CET is supported and if CET is not [55:44.040 --> 55:52.000] supported, it patches out all these CET instructions in the existing binary to turn them into knobs. [55:52.000 --> 55:55.960] And obviously, being a microkernel, we try to generalize the mechanism. [55:55.960 --> 56:00.280] So we generalize that mechanism to be able to rewrite arbitrary assembler snippets from [56:00.280 --> 56:02.480] one version to another version. [56:02.480 --> 56:06.480] And there's other examples for newer instructions that do better work than older instructions [56:06.480 --> 56:11.760] like the Xsave feature set, which can save supervisor state or save floating point state [56:11.760 --> 56:13.960] in a compact format. [56:13.960 --> 56:20.480] And the binary, as you build it originally, always uses the most sophisticated version. [56:20.480 --> 56:23.560] So it uses the most advanced instruction that you can find. [56:23.560 --> 56:28.400] And if we run that on some CPU, which doesn't support the instruction or which supports [56:28.400 --> 56:33.440] some older instruction, then we use code patching to rewrite the newer instruction into the [56:33.440 --> 56:34.440] older one. [56:34.440 --> 56:40.240] So the binary automatically adjusts to the feature set of the underlying hardware. [56:40.240 --> 56:45.520] The newer your CPU, the less patching occurs, but it works quite well. [56:45.520 --> 56:49.480] And the reason we chose this approach, because the alternatives aren't actually great. [56:49.480 --> 56:53.800] So the alternatives would have been that you put some if-defs in your code and you say, [56:53.800 --> 56:57.240] if they've CET, use the CET instructions, and otherwise don't. [56:57.240 --> 57:01.360] And then you force your customers or your community to always compile the binary the [57:01.360 --> 57:04.080] right way, and that doesn't scale. [57:04.080 --> 57:09.280] The other option could have been that you put some if-then-else, you say, if CET is supported, [57:09.280 --> 57:11.400] do this, otherwise do that. [57:11.400 --> 57:14.040] And that would be a runtime check every time. [57:14.040 --> 57:19.200] And that runtime check is prohibitive in certain code paths, like NT paths, where you simply [57:19.200 --> 57:24.040] don't have any register-free for doing this check because you have to save them all. [57:24.040 --> 57:29.080] But in order to save them, you already need to know whether Shadows.Tex are supported [57:29.080 --> 57:30.080] or not. [57:30.080 --> 57:36.240] So doing this feature check at boot time and rewriting the binary to the suitable instruction [57:36.240 --> 57:39.540] is what we do, and that works great. [57:39.540 --> 57:46.080] So the way it works is you declare some assembler snippets, like Xsave S is the preferred version. [57:46.080 --> 57:52.240] If Xsave S is not supported, the snippet gets rewritten to Xsave, or a Shadows.Tex instruction [57:52.240 --> 57:56.560] gets rewritten to a knob. [57:56.560 --> 58:01.440] We don't need to patch any high-level C++ functions because they never compile to those [58:01.440 --> 58:03.840] complicated instructions. [58:03.840 --> 58:09.740] And yeah, we basically have a binary that automatically adjusts. [58:09.740 --> 58:16.320] So finally, let's take a look at performance because IPC performance is still a relevant [58:16.320 --> 58:21.080] metric if you want to be not just small but also fast. [58:21.080 --> 58:27.520] And the blue bars here in the slide show Nova's baseline performance on modern Intel [58:27.520 --> 58:32.360] platforms like NUC12 with Alder Lake and NUC11 with Tiger Lake. [58:32.360 --> 58:37.240] And you can see that if you do an IPC between two threads in the same address space, it's [58:37.240 --> 58:41.360] really in the low nanosecond range, like 200 and some cycles. [58:41.360 --> 58:45.880] If you cross address spaces, you have to switch page tables, you have to maybe switch class [58:45.880 --> 58:51.760] of service, then it takes 536 cycles. [58:51.760 --> 58:55.520] And it's comparable on other micro-architectures, but the interesting thing that I want to [58:55.520 --> 59:01.280] show with this slide is that there's overhead for control flow protection. [59:01.280 --> 59:10.200] So if you just enable indirect branch tracking, the performance overhead is some 13% to 15%. [59:10.200 --> 59:15.520] If you enable shadow stacks, the performance overhead is increased some more. [59:15.520 --> 59:21.040] And if you enable the full control flow protection, the performance overhead is in the relevant [59:21.040 --> 59:25.000] case, which is the cross address space case, it's up to 30%. [59:25.000 --> 59:29.640] So users can freely choose through these compile time options what level of control [59:29.640 --> 59:35.200] flow protection they are willing to trade for what in decrease in performance. [59:35.200 --> 59:41.040] So the numbers are basically just ballpark figures to give people feeling for if I use [59:41.040 --> 59:44.880] this feature, how much IPC performance do I lose? [59:44.880 --> 59:47.040] So with that, I'm at the end of my talk. [59:47.040 --> 59:51.640] There are some links here where you can download releases, where you can find more information. [59:51.640 --> 59:54.000] And now I'll open it up for questions. [59:54.000 --> 59:57.000] Thank you so much, Udon. [59:57.000 --> 01:00:01.200] So we have time for some questions. [01:00:01.200 --> 01:00:02.200] Yeah. [01:00:02.200 --> 01:00:03.960] And then you're partying. [01:00:03.960 --> 01:00:04.960] Thank you. [01:00:04.960 --> 01:00:14.880] It was really, really nice talk with us to see how many new things are in Nova. [01:00:14.880 --> 01:00:21.400] One thing I would like to ask is you mentioned that page table code is formally verified [01:00:21.400 --> 01:00:23.920] and that it's also lock free. [01:00:23.920 --> 01:00:30.360] What tools did you use for formal verification, especially in regards of memory model for [01:00:30.360 --> 01:00:31.360] verification? [01:00:31.360 --> 01:00:32.360] Thank you. [01:00:32.360 --> 01:00:36.560] So I must say that I'm not a formal verification expert, but I obviously have regular meetings [01:00:36.560 --> 01:00:38.520] and discussions with all the people. [01:00:38.520 --> 01:00:44.800] And the tools that we are using is the Koch theorem for basically doing the proofs. [01:00:44.800 --> 01:00:51.040] But for concurrent verification, there's a tool called iris that implements separation [01:00:51.040 --> 01:00:52.040] logic. [01:00:52.040 --> 01:01:00.960] Well, the memory model that we verify depends on whether you're talking about x86 or ARM. [01:01:00.960 --> 01:01:05.920] For ARM, we're using multi-copy atomic memory model. [01:01:05.920 --> 01:01:10.680] Also, thanks for the talk. [01:01:10.680 --> 01:01:13.920] And it's great to see such a nice progress. [01:01:13.920 --> 01:01:14.920] Just a quick question. [01:01:14.920 --> 01:01:19.160] In the beginning of the talk, you said that you have this command line option to clamp [01:01:19.160 --> 01:01:24.440] the CPU frequency to disable the turbo boosting. [01:01:24.440 --> 01:01:26.200] Why can't you do that at runtime? [01:01:26.200 --> 01:01:28.440] Why can't you configure it at runtime? [01:01:28.440 --> 01:01:33.560] We could configure it at runtime too, but we haven't added an API yet because the code [01:01:33.560 --> 01:01:36.680] that would have to do that simply doesn't exist yet. [01:01:36.680 --> 01:01:44.600] But there's no technical reason for why userland couldn't control the CPU frequency at arbitrary [01:01:44.600 --> 01:01:45.600] points in time. [01:01:45.600 --> 01:01:46.600] Okay, wonderful. [01:01:46.600 --> 01:01:47.600] Thanks. [01:01:47.600 --> 01:01:55.600] I was going to ask you about the verification aspect of this. [01:01:55.600 --> 01:01:56.600] Okay, got you. [01:01:56.600 --> 01:01:57.600] Any other questions? [01:01:57.600 --> 01:01:58.600] Yeah. [01:01:58.600 --> 01:02:02.600] Can you just say, sorry, Jonathan, it's going to be a lot too. [01:02:02.600 --> 01:02:08.600] Yeah, just to clarify, on the point of the DMA attack, were you talking about protecting [01:02:08.600 --> 01:02:13.360] the guests or the host or the DMA attack? [01:02:13.360 --> 01:02:19.080] So the question was for the DMA attack that I showed in this slide here, and you'll find [01:02:19.080 --> 01:02:22.640] the slides online after the talk. [01:02:22.640 --> 01:02:26.640] This is not a DMA attack of guest versus host, this is a boot time DMA attack. [01:02:26.640 --> 01:02:31.560] So this is, you can really think of this as a timeline, firmware starts, boot loader starts, [01:02:31.560 --> 01:02:32.560] Nova starts. [01:02:32.560 --> 01:02:39.160] And at the time that Nova turns on the IOMU, both guests and hosts will be DMA protected. [01:02:39.160 --> 01:02:44.600] But Nova itself could be susceptible to DMA attack if we didn't disable bus master simply [01:02:44.600 --> 01:02:50.840] because the firmware does this legacy backward compatible shenanigans that we don't like. [01:02:50.840 --> 01:02:55.960] And I bet a lot of other microcalls are susceptible to problems like this too, and the fix would [01:02:55.960 --> 01:02:57.960] work for them as well. [01:02:57.960 --> 01:03:00.520] Thanks, Udo, for the talk. [01:03:00.520 --> 01:03:07.760] I would like to know, can you approximate how much percentage of the architecture specific [01:03:07.760 --> 01:03:18.800] code is now added because of the security measures? [01:03:18.800 --> 01:03:26.480] So most of the security measures that I talked about are x86 specific, and ARM has similar [01:03:26.480 --> 01:03:31.200] features like they have a guarded control stack specified in ARM v9, but I don't think [01:03:31.200 --> 01:03:33.040] you can buy any hardware yet. [01:03:33.040 --> 01:03:40.160] You can take the difference between x86 and ARX64 as a rough ballpark figure, but it's [01:03:40.160 --> 01:03:45.520] really not all that much the, for example, the multi key total memory encryption. [01:03:45.520 --> 01:03:51.040] That's just a few lines of code added to the x86 specific pitch table class because it [01:03:51.040 --> 01:03:54.880] was already built into the generic class to begin with. [01:03:54.880 --> 01:04:04.840] Code flow enforcement is probably 400 lines of assembler code in entry, pass, and the switching. [01:04:04.840 --> 01:04:09.400] I did a quick test as to how many end branch instructions a compiler would actually inject [01:04:09.400 --> 01:04:10.400] into the code. [01:04:10.400 --> 01:04:15.000] It's like 500 or so because you get one for every interrupt entry and then one for every [01:04:15.000 --> 01:04:19.120] function, and it also inflates the size of the binary a bit, but not much. [01:04:19.120 --> 01:04:22.880] And the performance decrease for indirect branch checking, among other things, comes [01:04:22.880 --> 01:04:27.040] from the fact that the code gets inflated and it's not as dense anymore. [01:04:27.040 --> 01:04:28.040] Okay. [01:04:28.040 --> 01:04:32.040] Yeah, final question, please, because red is one of the, yeah. [01:04:32.040 --> 01:04:39.040] You were saying that you were able to achieve an L binary without rotations. [01:04:39.040 --> 01:04:40.040] Yeah. [01:04:40.040 --> 01:04:46.600] Can you elaborate a little bit on how did you do that, which linker did you use? [01:04:46.600 --> 01:04:54.240] So it's the normal GNU-LD, but you could also use gold or mold or any of the normal linkers. [01:04:54.240 --> 01:05:00.920] So the reason for why no relocation is needed is for the page code, as long as you put the [01:05:00.920 --> 01:05:05.640] right physical address in your page table, the virtual address is always the same. [01:05:05.640 --> 01:05:10.040] So virtual memory is some form of relocation where you say no matter where I run in physical [01:05:10.040 --> 01:05:12.360] memory, the virtual memory is always the same. [01:05:12.360 --> 01:05:18.800] For the unpaged code, which doesn't know at which physical address it was actually launched, [01:05:18.800 --> 01:05:22.600] you have to use position independent code, basically say I don't care at which physical [01:05:22.600 --> 01:05:29.440] address I run, I can run it in arbitrary address because all my data structures are addressed [01:05:29.440 --> 01:05:31.240] with relative or something like that. [01:05:31.240 --> 01:05:34.720] And at some point you need to know what the offset is between where you want it to run [01:05:34.720 --> 01:05:36.960] and where you do actually run, but that's simple. [01:05:36.960 --> 01:05:40.560] It's like you call your next instruction, you pop the return address of the stack, you [01:05:40.560 --> 01:05:43.720] compute the difference and then you know. [01:05:43.720 --> 01:05:44.720] Thank you so much, Udo. [01:05:44.720 --> 01:05:45.720] Thank you. [01:05:45.720 --> 01:06:14.720] So the slides are online, the recording as well.