[00:00.000 --> 00:12.960] Hello, everyone. I'm Zahra. I work for Microsoft as a part of Linux Systems Group, LSU. This [00:12.960 --> 00:19.760] type is going to be about an ongoing project that we have on hardening Linux kernel with [00:19.760 --> 00:26.280] architecture capabilities. To be clear, this is not a product by Microsoft, so it's more [00:26.280 --> 00:33.400] like an exploratory project that we want to see if this hardware feature can be used [00:33.400 --> 00:40.080] for security issues that we want to fix on Linux kernel, and also as a part of this process, [00:40.080 --> 00:47.680] if we can find some new vulnerabilities, attack vectors, and things like that. [00:47.680 --> 00:53.320] So I'm going to start with a very brief background, like an intro on Cherry and a state of the [00:53.320 --> 01:03.360] world Linux, and then some of my work that capability-based hardening and future work [01:03.360 --> 01:08.320] and opportunities that I'm really hopeful that open source community can help us with that. [01:08.320 --> 01:16.760] So the big picture problem is that, as you all know, operating systems are really complex. [01:17.720 --> 01:26.760] We have millions of lines of code with a lot of complex abstractions, and basically any forms [01:26.760 --> 01:32.560] of proper hardening for a kernel is still an open problem. We have all kinds of software-based, [01:32.560 --> 01:38.160] like control flow integrity, like approaches, like compiler-based techniques for fuzzing, [01:38.160 --> 01:45.360] and all of these approaches are helpful, but we still see lots of vulnerabilities. Some of these [01:45.360 --> 01:51.800] are memory safety, like many of them are memory safety, but some are also logical problems because [01:51.800 --> 01:58.560] of this complex monolithic structure of the kernel. At the same time, Linux kernel also has [01:58.560 --> 02:08.480] different security subsystems. We have a combination of, for example, LSMs, DAC, sandboxing [02:08.480 --> 02:17.360] techniques for, for example, SECCOM, EVPF, and still also in this complex stack, the proper [02:17.360 --> 02:22.400] integration and hardening of these security subsystems themselves is not also a clear, [02:22.400 --> 02:30.840] we don't have a clear solution for that. At the same time, we also have a lot of ongoing hardware [02:30.880 --> 02:39.080] security features that are going, like, are adding to or suffer our hardware platforms. Like, for [02:39.080 --> 02:43.920] example, just on ARM, we have a combination of, like, core-screen privilege separation techniques, [02:43.920 --> 02:50.280] like trust zone T's. We have, like, more, like, at the same time, like, finer-grained memory [02:50.280 --> 02:57.560] safety features, like pointer authentication, memory tag extensions, and as we go, like, [02:57.960 --> 03:03.880] for example, modern hardware, we can have, like, for example, resource domain controllers, and all [03:03.880 --> 03:09.960] of these, like, hardware features are not really, so they are there, but our operating system doesn't [03:09.960 --> 03:17.880] really use them, like, in a fundamental way, like, in a basically principal approach. So this is the [03:17.880 --> 03:25.920] big picture of, like, lots of problems that we have for both hardening the kernel and also, [03:25.960 --> 03:33.560] like, using these hardware security features properly. And Cherry is one of, like, this fine-grained [03:33.560 --> 03:40.760] both for memory safety and for extensible compartmentalization features. That it has, like, [03:40.760 --> 03:46.480] a really old history from the University of Cambridge. I think, like, about 14, 15 years of, [03:46.480 --> 03:53.800] like, research is behind it. And the concept of, like, capability-based security models, [03:53.960 --> 04:00.920] it's, that concept is not new. We have it, like, even on, like, file descriptor, like, abstraction [04:00.920 --> 04:08.360] for Linux. So basically, having an unforeachable, like, token of authority that's, it's for, like, [04:08.360 --> 04:16.840] accessing any kind of, like, sensitive object. But the novelty of Cherry is that you have this [04:16.840 --> 04:25.720] hardware-software semantic approach for bringing this, like, concept, this, like, memory safety [04:25.720 --> 04:33.560] concept to, like, both your hardware architecture and, like, an instruction level and also, like, [04:33.560 --> 04:39.400] really have the opportunity to redesign your systems as, like, based on that. So they have, [04:39.880 --> 04:48.200] like, these extensions on MIPS, on RISC-5, and on, recently on ARM. And also, like, I think they [04:48.200 --> 04:53.480] complete, like, an example of their systems as, like, is based on previousity. So the Linux one is [04:53.480 --> 05:02.840] a new one that's mostly ARM, like, folks are working on. So what's Morello? Morello is, [05:03.720 --> 05:11.160] basically, the new development, like, experiments kind of, like, board for having, for adding Cherry [05:11.160 --> 05:20.040] to ARM V8. And it's extending, like, basically, the entire, like, instructions with, like, new [05:20.040 --> 05:26.600] registers and new sets of, like, and also extending, like, previous, like, systems registers for ARM. [05:26.600 --> 05:35.320] It's basically, like, introducing, like, the 129-bit pointers. So every Cherry pointer has, [05:35.320 --> 05:43.000] like, besides the value, it has the whole set of, like, metadata that contains its bond, [05:43.000 --> 05:49.160] boundary of, like, the memory region, the object type, and the permissions, like, basically, for [05:49.240 --> 05:57.320] any kind of, like, access to that pointer. And ARM also has this, like, added this notion of, [05:57.320 --> 06:06.280] like, controlled non-monoticity. That's basically, like, it's trying to also, because, as you know, [06:06.280 --> 06:15.640] like, ARM has, like, at least six execution levels for, like, EL0, EL1, EL2, and, like, [06:15.640 --> 06:21.560] at the same time, like, in the secure board. So somehow, this notion of, like, Moeller should [06:21.560 --> 06:27.960] be extended to all of these execution levels. So that's why they added this through, like, [06:27.960 --> 06:35.480] new exceptions sets, new executive and restrictive mode, privileged execution, and also some unsealing, [06:35.480 --> 06:42.280] like, operation that I'm going to describe later. So for every pointer, we have this [06:42.280 --> 06:48.760] permission set that, basically, by hardware, you can say, like, this, like, piece of memory should [06:48.760 --> 06:55.320] have what kind of, like, permission access. It can have, like, load, execute, just store, [06:55.320 --> 07:00.680] or even more complex, like, access controls, like, if you want to have it immutable, for example, [07:00.680 --> 07:07.400] like, region, like, through ceiling, or if you want to have even, like, systems-based, like, access [07:07.960 --> 07:13.000] controls, like, for, and software-defined, like, waste access controls, that if you want to have [07:13.000 --> 07:21.080] your own custom, basically, permissions to be defined for that, like, pointer, that capability. [07:22.280 --> 07:27.640] And the interesting thing is that, like, this system, like, accessing system registers that [07:27.640 --> 07:35.400] you can define to these capabilities, it's, the behavior is still, it's not, like, really affected [07:35.400 --> 07:42.840] by your, like, hypervisor mode, like, HVC calls, SMC calls to set security monitor, and also [07:42.840 --> 07:54.280] supervisor mode. So as you can see, besides, besides, like, the notion of, like, capabilities, [07:54.280 --> 08:01.800] you need to, like, change, basically, ARM had to change, like, several of, like, system registers, [08:01.800 --> 08:07.960] including the control registers. It, like, we have new registers, for example, for bounds, for [08:07.960 --> 08:13.720] setting, like, converting, like, pointer capabilities, like, we had to, we have to, like, have a new, [08:14.840 --> 08:21.640] like, PC, like, program counter, like, in some PCC, for example, instead of, like, PC, but at the same [08:21.640 --> 08:31.000] time, the execution levels for, like, EL0, EL1, EL2, and EL2, all of these should also be aware [08:31.000 --> 08:35.960] of the concept of capabilities. So most of, like, the control registers should also, like, [08:36.920 --> 08:42.840] they're also changed. And there are, like, you see, like, for example, CTLR, it's now, like, [08:43.480 --> 08:50.680] capability-based, like, CTLR. And this, this similar thing, for example, for, like, [08:50.680 --> 08:56.120] trade IDs, and things like that. So, for example, like, the neural Linux, [08:56.120 --> 09:02.440] like, trade structure tab, like, instead of, like, trade ID, like, traditional, it has, like, control [09:02.440 --> 09:09.080] capability-based trade IDs, or restrictive-based, like, trade IDs, that you can find most of these [09:09.080 --> 09:16.040] details in the technical manual. Similarly, as I said, like, we have a new set of exceptions [09:16.040 --> 09:20.360] for, basically, capability-based exceptions for any faults that you get from, like, [09:21.160 --> 09:26.840] permissions, like, accessing them, like, setting boundaries, like, right or wrong, so things like [09:26.840 --> 09:33.240] that. So, as I said, the whole details, it's, like, basically, a lot of details, so you can [09:34.040 --> 09:41.000] find them mostly on, like, Cherry site, and Morello, like, project, all of them, especially the arm [09:41.000 --> 09:47.720] one, it's, like, everything is, like, open, so you can, I'd really, like, if anybody in community [09:47.720 --> 09:55.800] to go and check this. So, about the state of Morello Linux, it's, the, the maintainers are [09:55.800 --> 10:02.360] most from ARM. They are really doing a very good job on, basically, in a very short time, they have [10:02.360 --> 10:10.840] a stable environment for Linux development. If you go, look at that, you see that, like, they're [10:10.840 --> 10:17.720] already, like, enabled most, most syscals. They're already, like, they have, like, distros, like, [10:18.280 --> 10:24.760] like, Debian, and they have both, like, even if you don't have, like, the development board, [10:24.760 --> 10:31.960] you can just, like, use their FVP, fixed virtualization platform, something that's basically [10:31.960 --> 10:40.280] an good emulator. And the whole system is, like, really, like, ready for experiment for both from [10:40.280 --> 10:51.320] the user space and the kernel development. Also, like, from their perspective, like, they modified [10:51.320 --> 10:57.400] most of, like, the main modifications of, like, memory management for adding capability-based [10:57.400 --> 11:03.400] architecture and things, like, that they're added. The main problems now that I'm going to discuss are [11:04.040 --> 11:10.600] from the security perspective, can be from the, like, the intersections of, like, user and security, [11:10.600 --> 11:15.720] user space and, like, the kernel space security, their interactions, their shared memory, and [11:15.720 --> 11:24.520] things like that. So, for example, in my experience, when, so, I first started with enabling some of the [11:25.480 --> 11:32.600] security features to more Linux, and the experiment was, like, really easy. I was just, [11:33.160 --> 11:42.280] trying to, like, for example, get the TE stack, like, TE driver running, adding, like, trusted keys, [11:42.280 --> 11:50.280] like, like, BVPF, like, checking, like, if the BVPF is working, like, properly on more low. [11:50.920 --> 11:56.520] And in most of the cases, when you want to add, like, these features to the more Linux, [11:56.520 --> 12:04.120] the, like, issues that I was seeing is, like, minor issues. So, basically, mostly, like, [12:04.120 --> 12:11.000] pointer mismatch, like, in the current architecture that they have, like, a pure capability-based [12:11.000 --> 12:16.840] ABI, most of the, like, so, basically, most of the issues coming from, like, when you enable these [12:16.840 --> 12:24.520] features, like, you have, like, traditional pointer abstractions that you need to convert [12:24.520 --> 12:33.160] them to capability-based abstraction. And, for example, like, when I was working on enabling [12:33.160 --> 12:39.640] the U-Axis, that, as you know, like, U-Axis is mostly from the Linux, but you have, like, [12:39.640 --> 12:45.720] all of the Linux abstraction for, like, function for communicating with the user space, like, [12:45.720 --> 12:52.120] passing pointers, passing shared memory, and things like that. And so, this required, like, [12:52.120 --> 12:59.720] low-level, like, I think, capability, like, instructions to Linux. And after we did that, [12:59.720 --> 13:05.000] like, changing the put user, like, get user, and things like that, basically, like, the kernel [13:05.000 --> 13:11.320] breaks, like, in several places. But the break was mostly on, like, okay, for example, here, [13:11.880 --> 13:19.880] on, I notify user, it says that, like, this pointer, like, it's a, like, integer user space, [13:19.880 --> 13:27.240] like, pointer. And so, it's not a capability. So, basically, we need to find out, like, dig out, [13:27.240 --> 13:33.000] like, what kind of, like, pointer is, it's, like, that, it's an address, or it's just, like, an [13:33.000 --> 13:38.200] integer pointer, and things like that. And try to use the abstract, like, tree abstraction to [13:38.200 --> 13:45.480] convert them, like, in a secure way, to capability. And then, like, it, for example, this one was, [13:45.560 --> 13:53.160] like, a kind of large patch that still needs more filling up. But, like, after fixing this, like, [13:54.120 --> 14:00.920] about 50 files, but in a very, like, a small, like, lines of code, you have, like, [14:01.800 --> 14:10.440] we have the user space, like, based, like, capability back-ends for Linux. But the tweaks are, [14:10.440 --> 14:17.000] like, actually, the good thing about this, this process is that you can find out, like, dig out, [14:17.000 --> 14:22.920] like, if there is a, like, basically, viability, like, if there were, at some, some of them, [14:22.920 --> 14:30.520] it was just not just a cherry, like, the pointer mismatch was, like, something that could, could [14:31.160 --> 14:36.040] provide, like, could be an issue, like, in the future, like, to have memory viability issues, [14:36.040 --> 14:43.080] or things like that, or if, for example, they had the boundaries right, and things that can go wrong, [14:43.080 --> 14:53.080] even, like, if we don't have cherry. So, the good news is, you have a lot of helper functions, [14:53.080 --> 15:00.440] like, from the compiler for, like, using cherry, getting, like, setting boundaries, and, like, [15:00.520 --> 15:09.080] converting capabilities to pointers and vice versa. So, that's good. The other thing that's, [15:09.720 --> 15:16.200] so, the current state of the model in Linux is that there's a main root capability that's, [15:16.200 --> 15:22.280] basically, every other capability is generated from that. So, what I'm working on is, basically, [15:22.280 --> 15:28.280] adding more, like, finer-growing, like, capabilities for both ceiling, like, making, [15:29.240 --> 15:35.240] especially, like, the sensitive, like, parts of the user space or the kernel immutable after, [15:35.240 --> 15:41.640] like, the operations are done. And, also, like, so, basically, we need to add more root capabilities [15:41.640 --> 15:49.400] for the user space, for, like, a specific capability for ceiling and making, like, both the kernel [15:49.400 --> 16:01.320] subsystems and user space subsystems immutable. And, also, we need to use better the concept of, like, [16:01.320 --> 16:09.960] this software-defined permissions on cherry. So, one of the, for example, custom, like, [16:10.920 --> 16:19.960] permissions that are added on free BSD is the permission on syscalls and permission on software, [16:19.960 --> 16:27.400] for example, virtual memory. So, this will, kind of, like, let them to define, like, and [16:27.400 --> 16:34.680] sandboxing the, like, environment, sandboxing abstractions that's, basically, backed by cherry. [16:35.320 --> 16:40.280] That's, like, this, this is, like, a really useful thing that, for example, can be, like, [16:40.280 --> 16:49.320] useful for EVPF or sitcom-based syscall filtering. So, the other thing that, like, we are working on [16:49.320 --> 16:55.000] is that, like, what's, basically, the better combinations of, like, this software-defined [16:55.000 --> 17:00.840] permissions for Linux and sandboxing. That's, basically, can, can get a lot of, like, feedback, [17:00.920 --> 17:07.800] it's usually useful to get feedback from EVPF guys, like, spokes and, like, sandboxing people [17:07.800 --> 17:13.640] that are working on sandboxing. So, see if, like, we can add these kind of, like, abstractions [17:13.640 --> 17:28.120] and integrate them properly to Linux security subsystems. So, as I said, like, most of the, [17:28.120 --> 17:34.680] like, the goal of this project at the end is, like, we want to use this hardware feature, [17:34.680 --> 17:41.320] like, similar hardware features, for protecting Linux security subsystems. [17:41.320 --> 17:49.800] LSMs, EVPF, and name-assets. So, basically, this is a very, like, open, like, area that [17:49.800 --> 17:55.480] it can be, like, really benefited from community and open-source community to be involved. [17:56.200 --> 18:03.560] And, besides that, one of the things that it's, like, it's, it's really, like, an [18:03.560 --> 18:08.920] earliest stage for, for it is that the whole systems are stacked from the hypervisor, from [18:08.920 --> 18:14.840] secure kernels and trusted execution environments. Now, for the first time, we have this option [18:14.840 --> 18:21.880] that we can, like, integrate fine-grained, like, memory protections and scalable, like, [18:21.960 --> 18:28.360] compartmentalization features to, for example, or trusted execution environment. And there's, [18:28.360 --> 18:34.600] like, a huge area of, like, attack vectors between the interactions of, like, this secure world and, [18:34.600 --> 18:40.040] like, the normal world and, like, the kernel environment and all of these, like, systems [18:40.040 --> 18:47.400] are stacked now have, like, a way to, like, to protect us from the secure RPC passing, like, [18:48.040 --> 18:54.120] pointers. That's basically the main attack vectors for all of, like, T, T environments. [18:54.840 --> 19:01.880] Now, we have the options to redesign these stacks based on these fine-grained security features. [19:03.880 --> 19:13.880] So, to summarize, the state of Linux, the Moerla Linux is really ready for, like, [19:13.960 --> 19:20.280] a special open-source community to get involved. And there are a lot of, like, basically open [19:20.280 --> 19:28.600] problems that we're looking forward to, like, get feedback from, especially, like, Linux community [19:28.600 --> 19:35.080] that are working on security subsystem to both, like, hardening the kernel itself, hardening the [19:35.080 --> 19:41.960] kernel security subsystems and, at the same time, adding more compartmentalization tools, [19:42.680 --> 19:49.480] sandboxing tools based on these kind of fine-grained features. Because at the end, [19:49.480 --> 19:55.800] the Linux, like, privilege separation is really coarse-grained. And most of the problems are, [19:55.800 --> 20:01.640] like, from, like, these huge, more intricate stacks can be solved if we have a more, like, [20:01.640 --> 20:07.800] better abstraction for privilege separation and compartmentalization. Also, from people, [20:07.800 --> 20:15.800] like, if they're working on debugging and tracing, that's also, like, a huge problem, like, open a [20:15.800 --> 20:24.120] space for capability-based systems, how we do it securely, how we basically do it properly. Now, [20:24.120 --> 20:28.680] you have these options that, for example, like, instead of the, you can define, for example, [20:28.680 --> 20:36.280] secure domains for, like, giving some capabilities to, for secure debugging. And then, for example, [20:36.280 --> 20:45.480] like, you don't need to be worried about, like, shutting down security, like, security, for example, [20:45.480 --> 20:51.560] secure boot or, like, security features of your system just to do debugging. Because, as you know, [20:51.560 --> 21:02.840] like, it's a, by nature, it's an insecure property. So, I'm happy to get any questions if you have. [21:02.840 --> 21:06.840] And let me know if you're interested in working on this project. [21:09.000 --> 21:10.840] Yeah, thank you for the talk.