[00:00.000 --> 00:12.520]  Our next talk is by Stefan, he's the project leader for LexD, a container manager, a former
[00:12.520 --> 00:17.160]  teammate of mine as well, and he's going to talk about safe containers through system
[00:17.160 --> 00:18.160]  call interception.
[00:18.160 --> 00:19.160]  Hello.
[00:19.160 --> 00:20.160]  It starts working well.
[00:20.160 --> 00:21.160]  Thanks, sir.
[00:21.160 --> 00:22.160]  All right.
[00:22.160 --> 00:24.160]  So, you can edit the intro.
[00:24.160 --> 00:25.160]  I'm Stefan.
[00:25.160 --> 00:26.160]  I work at County Call.
[00:26.160 --> 00:32.760]  I'm the project leader for LexD, LexEFS, and a bunch of other stuff that we do, effectively
[00:32.760 --> 00:33.760]  system container guide.
[00:33.760 --> 00:37.960]  And, yeah, we're going to be talking about system call interception today.
[00:37.960 --> 00:42.880]  First, just a tiny bit of going back to the basics.
[00:42.880 --> 00:47.120]  We can't need to explain what we're trying to achieve.
[00:47.120 --> 00:49.280]  So there are two main kind of containers out there.
[00:49.280 --> 00:52.480]  We've got privileged containers and unprivileged containers.
[00:52.480 --> 00:55.520]  The ones you want are the unprivileged guide.
[00:55.520 --> 01:01.600]  Privileged is bad, and just to clarify there, too, we don't mean privileged as in dash dash
[01:01.600 --> 01:02.880]  privileged in Docker.
[01:02.880 --> 01:05.040]  That's extra, extra bad.
[01:05.040 --> 01:08.680]  Docker by default is privileged, and the definition of privileged is whether you're using a user
[01:08.680 --> 01:11.080]  namespace or not.
[01:11.080 --> 01:15.520]  So in the case of LexD, which is a container manager that I'm working on these days, with
[01:15.520 --> 01:18.240]  default to unprivileged containers, that's great.
[01:18.240 --> 01:20.880]  It means that root in the container is not root on the host.
[01:20.880 --> 01:25.040]  If there's a container escape of any kind, you don't, you get as much permission as a
[01:25.040 --> 01:28.320]  nobody user on the system, that's great.
[01:28.320 --> 01:35.120]  Problem is, not being real root also means you don't get to do stuff that real root can
[01:35.120 --> 01:36.120]  do.
[01:36.120 --> 01:40.360]  A lot of stuff have been enabled now inside the user namespace that you can do yourself.
[01:40.360 --> 01:41.360]  You can create network devices.
[01:41.360 --> 01:42.360]  You can reconfigure.
[01:42.360 --> 01:43.960]  A lot of stuff is great.
[01:43.960 --> 01:45.840]  But there are still things you can't do.
[01:45.840 --> 01:49.720]  You can't change your process priority to something higher than what you would be allowed
[01:49.720 --> 01:51.440]  to do as a normal user on the system.
[01:51.440 --> 01:55.280]  Otherwise, a user on the system can just create a new user namespace, go in it, and bump their
[01:55.280 --> 02:00.320]  process priority to whatever they want, and bypass all kind of settings on the system.
[02:00.320 --> 02:03.960]  So there's a lot of things that are not quite possible.
[02:03.960 --> 02:07.920]  In general, we want to eradicate privileged containers because having real root is very,
[02:07.920 --> 02:08.920]  very bad.
[02:08.920 --> 02:14.640]  And it's kind of a game of, like, welcome all, as far as trying to prevent nasty things
[02:14.640 --> 02:15.640]  from happening.
[02:15.640 --> 02:16.640]  We've got Apama.
[02:16.640 --> 02:17.640]  We've got SecComp.
[02:17.640 --> 02:20.800]  We've got a whole bunch of things that are all trying to prevent you from doing bad things.
[02:20.800 --> 02:25.560]  So that's done by us thinking about what all the bad things are and trying to block them.
[02:25.560 --> 02:28.240]  And someone just needs to find another bad thing we didn't think of.
[02:28.240 --> 02:30.080]  And then there goes the entire system.
[02:30.080 --> 02:31.080]  So we don't want those.
[02:31.080 --> 02:33.440]  We'd like to get rid of them completely.
[02:33.440 --> 02:38.400]  But for that, we need to find ways to allow for unprivileged environments to do things
[02:38.400 --> 02:43.480]  that are normally only allowed to be done by privileged environments in a way that's
[02:43.480 --> 02:44.480]  still safe.
[02:44.480 --> 02:45.480]  All right.
[02:45.480 --> 02:53.040]  So all I'm going to be talking about today relies on SecComp, which is the system call
[02:53.040 --> 02:55.880]  interception mechanism in Linux.
[02:55.880 --> 02:59.520]  It lets you do a bunch of nice policies.
[02:59.520 --> 03:03.520]  You can just put policies for, like, this system code with those arguments, just deny
[03:03.520 --> 03:11.000]  them or return this return code or return, yeah, this particular error number, for example.
[03:11.000 --> 03:14.720]  But it also grew the ability with Linux 5.0 in 2019.
[03:14.720 --> 03:19.400]  It grew the ability to notify user space instead.
[03:19.400 --> 03:26.280]  So you can put a policy in SecComp that says, if this is called and the arguments are so
[03:26.280 --> 03:33.440]  insert, instead of doing, taking action right now, go and notify this file descriptor that
[03:33.440 --> 03:34.840]  something happened.
[03:34.840 --> 03:39.920]  And then the whole system can have a privileged demon monitoring that notification mechanism
[03:39.920 --> 03:43.360]  and take actions.
[03:43.360 --> 03:47.760]  There is some complexity around security that I'm going to get into very shortly because
[03:47.760 --> 03:50.320]  you can do very, very bad things with that.
[03:50.320 --> 03:56.480]  But if you do it correctly, it lets you run a more privileged action kind of on behalf
[03:56.480 --> 04:02.080]  of a less privileged container after going through some kind of a list or that kind of
[04:02.080 --> 04:07.280]  logic on the host to make sure that this is actually fine.
[04:07.280 --> 04:15.080]  Now for the nasty issues, time of check, time of use is a very, very common issue in security.
[04:15.080 --> 04:20.400]  And this mechanism has definitely got some issues around that.
[04:20.400 --> 04:24.360]  User space gets notified that a system code was made.
[04:24.360 --> 04:30.440]  The system code can have pointers to a bunch of different arguments and structures.
[04:30.440 --> 04:34.840]  And there's nothing preventing the caller from technically changing the value at those
[04:34.840 --> 04:36.000]  pointers.
[04:36.000 --> 04:38.880]  So you need to be a bit careful when you're processing those messages.
[04:38.880 --> 04:43.360]  You effectively need to start by copying everything, evaluating it.
[04:43.360 --> 04:45.600]  If everything looks good, then you can take actions.
[04:45.600 --> 04:50.160]  But by taking actions, we mean you can run the thing on behalf of the user with the original
[04:50.160 --> 04:55.760]  arguments, never putting them again because otherwise they could change.
[04:55.760 --> 05:00.040]  Or you could just say, I don't want this reject.
[05:00.040 --> 05:05.280]  What you shouldn't do is say, oh, based on those arguments, it seems fine.
[05:05.280 --> 05:06.920]  Let it continue.
[05:06.920 --> 05:10.560]  Because there's absolutely nothing that prevents the caller process from just racing you and
[05:10.560 --> 05:14.120]  immediately changing the arguments to something else before it goes back to the kernel and
[05:14.120 --> 05:17.800]  then running with a value you would not have allowed.
[05:17.800 --> 05:21.000]  So you need to be careful in your design so that this doesn't happen.
[05:21.000 --> 05:25.960]  Otherwise you're literally allowing people to run stuff as full route privilege inside
[05:25.960 --> 05:29.720]  of imperial containers, which would be very bad.
[05:29.720 --> 05:34.200]  So what do we actually do with this stuff?
[05:34.200 --> 05:36.440]  So far we've implemented quite a few things.
[05:36.440 --> 05:40.240]  I'm going to go into more details about each of those.
[05:40.240 --> 05:44.680]  The first thing we implemented, actually, I don't know if they are in the right order.
[05:44.680 --> 05:48.440]  One of the first things we implemented is make node.
[05:48.440 --> 05:51.800]  Then we followed that, which is useful for save device nodes creation.
[05:51.800 --> 05:54.160]  I'm going to go into more detail shortly.
[05:54.160 --> 05:57.360]  Set X adder, we also added pretty early on.
[05:57.360 --> 06:02.000]  We've got support for eBPF, so we can allow some specific eBPF programs.
[06:02.000 --> 06:07.560]  We've got support for set scheduler, which is used to change some of the process priorities.
[06:07.560 --> 06:11.840]  We've got support for mount, which was a real pain in the ass to implement, but we've got
[06:11.840 --> 06:12.840]  support for mount.
[06:12.840 --> 06:19.800]  And we've got support for this info, which was also reasonably fun to implement.
[06:19.800 --> 06:26.720]  Now kind of going over those things directly, make node, what do we use that for?
[06:26.720 --> 06:32.560]  One of the things we wanted to enable is for running Docker inside of lexd containers.
[06:32.560 --> 06:35.640]  As I said, lexd containers are on privilege, they are nice and safe.
[06:35.640 --> 06:38.720]  Docker, by and large, not safe.
[06:38.720 --> 06:42.560]  But Docker running inside of a privileged lexd container safe.
[06:42.560 --> 06:44.800]  So we figured we'd try and make that work.
[06:44.800 --> 06:47.160]  And we did manage to get it working.
[06:47.160 --> 06:50.600]  The main driver at the time was Travis CI.
[06:50.600 --> 06:57.800]  The Travis CI platform was using lexd containers on m64, IBM Z series, as well as IBM power
[06:57.800 --> 06:59.320]  at the time.
[06:59.320 --> 07:03.160]  And they wanted the same behavior as they were getting on Intel.
[07:03.160 --> 07:08.160]  And on Intel, they were using full VMs that you could do whatever you wanted them.
[07:08.160 --> 07:10.600]  So we wanted to make sure that Docker worked properly in there.
[07:10.600 --> 07:15.400]  And what we noticed is that Docker layers, especially the directory white out files and
[07:15.400 --> 07:24.200]  five white out files, rely on either, I think it's C00 device nodes, or they rely on specific
[07:24.200 --> 07:29.200]  excellent attributes that just say that this is like a directory that was removed, effectively
[07:29.200 --> 07:32.560]  in the underlay, using that as white out.
[07:32.560 --> 07:34.400]  Both cases, those things were not allowed.
[07:34.400 --> 07:37.640]  Device creation in a container is a big, big, big no go usually.
[07:37.640 --> 07:42.440]  Because if you can create, say, the device node for dev SDA, then there's nothing preventing
[07:42.440 --> 07:46.520]  your own previous container from rewriting your disk, which would be bad.
[07:46.520 --> 07:48.160]  So that's usually not allowed.
[07:48.160 --> 07:50.760]  But some specific device nodes are fine.
[07:50.760 --> 07:55.600]  And the work we did there also allows for things like creating a new dev null device
[07:55.600 --> 08:01.160]  or new dev zero device or those kind of devices which are inherently safe.
[08:01.160 --> 08:05.000]  And making that possible means that you can now do things like running the bootstrap or
[08:05.000 --> 08:09.160]  similar tools inside of an underprivileged container, because the few devices that are
[08:09.160 --> 08:15.640]  needed to be created as part of an image creation process are safe devices and this allows it.
[08:15.640 --> 08:20.040]  We generally consider this particular interception to be safe, as in you can pretty much turn
[08:20.040 --> 08:25.800]  it on on any complexity container without having to think too much about who's in that
[08:25.800 --> 08:30.240]  container, like do we actually trust the workloads on that kind of stuff.
[08:30.240 --> 08:33.720]  The other piece to that puzzle was set XSATA.
[08:33.720 --> 08:36.920]  As mentioned, same deal with Docker and the white out files.
[08:36.920 --> 08:39.040]  We needed to implement that one.
[08:39.040 --> 08:41.080]  Similarly, it does not allow all of the XSATA.
[08:41.080 --> 08:46.560]  It just allows very safe namespaces of XSATA attributes.
[08:46.560 --> 08:50.520]  It will not let you do things like setting a security XSATA attribute, because that would
[08:50.520 --> 08:53.520]  let you do some really, really bad things, for example.
[08:53.520 --> 08:57.920]  And this is similarly considered to be safe on our side.
[08:57.920 --> 09:02.760]  Then got the pretty interesting one, mount.
[09:02.760 --> 09:09.960]  Also, again, mount is a bit of a problem, because usually, well, first of all, usually
[09:09.960 --> 09:11.560]  it relies on having a block device.
[09:11.560 --> 09:16.160]  You kind of need to have that allowed in the container, which is already a bit fishy in
[09:16.160 --> 09:17.160]  many cases.
[09:17.160 --> 09:21.480]  You've got to be careful that any block device exposed to the container, you consider as
[09:21.480 --> 09:25.760]  entrusted from that portal and you never mount it as real roots somewhere else, or they could
[09:25.760 --> 09:27.320]  try and attack you.
[09:27.320 --> 09:33.480]  And by attack you, what I mean is the kernel has a super block parser that will process
[09:33.480 --> 09:36.160]  a new device as it gets mounted.
[09:36.160 --> 09:38.600]  And this is not guaranteed to be bug free.
[09:38.600 --> 09:45.240]  So a user that can craft a very specific block device might be able to trick something
[09:45.240 --> 09:49.880]  like X4, XFS, BarFS, or any of the other file systems into either crashing the entire
[09:49.880 --> 09:53.040]  system or doing arbitrary code execution in the kernel.
[09:53.040 --> 09:55.800]  Both cases not very good.
[09:55.800 --> 09:59.720]  But we still enable, so we still enable support for that.
[09:59.720 --> 10:05.200]  If you have a container that you trust, that you don't want to give full access for everything,
[10:05.200 --> 10:15.320]  but that you still trust, you can technically do this, and it will let you mount inside
[10:15.320 --> 10:16.560]  of the container.
[10:16.560 --> 10:21.240]  We added an extra layer on top of that, which lets us do a shift, because if you do the
[10:21.240 --> 10:26.600]  amounts, the ownership information on that device are probably not landing at the container.
[10:26.600 --> 10:31.440]  So we allow stacking shift effects, which is a fact-send that's hopefully dying soon,
[10:31.440 --> 10:36.600]  but that we implemented for Ubuntu, that we can stack on top and that fixes the permissions.
[10:36.600 --> 10:38.240]  So we support that as well.
[10:38.240 --> 10:42.920]  The really cool thing with this stuff is we also support redirecting to Fuse instead,
[10:42.920 --> 10:44.600]  which then becomes safe.
[10:44.600 --> 10:49.000]  So what we can do is say, Ernie attempt at mounting EXT4 inside of the container called
[10:49.000 --> 10:51.840]  defuse 2FS binary instead.
[10:51.840 --> 10:53.160]  And yeah, that's safe.
[10:53.160 --> 10:56.520]  That actually does work pretty well.
[10:56.520 --> 10:59.280]  And I'll show that in just a tiny bit.
[10:59.280 --> 11:04.080]  And then we worked on the BPF, not allowing all of the BPF programs, but specifically
[11:04.080 --> 11:10.040]  allowing those that we need to do nested containers and doing device permission management throughout.
[11:10.040 --> 11:14.160]  So we can review what the program is, if it matches what we expect, then we load it, otherwise
[11:14.160 --> 11:18.560]  we just reject entirely, and that's also considered to be safe.
[11:18.560 --> 11:23.480]  Then for an unsafe one, SCAD set scheduler is not super-duper safe, because it lets you
[11:23.480 --> 11:25.480]  reconfigure scheduler options.
[11:25.480 --> 11:29.680]  It was needed to be able to run Android inside of an entry-registerly container.
[11:29.680 --> 11:32.280]  They were doing some slightly wonky stuff on startup.
[11:32.280 --> 11:34.320]  That needed that.
[11:34.320 --> 11:38.440]  But we know that effectively the container could make itself unkillable, for example,
[11:38.440 --> 11:42.720]  or could raise its priority enough to slow down the rest of the system.
[11:42.720 --> 11:43.720]  So it's something to keep in mind.
[11:43.720 --> 11:48.720]  There's no way to escape that we're aware of by enabling this thing, but there's definitely
[11:48.720 --> 11:52.520]  some ability to affect the entire system.
[11:52.520 --> 11:53.520]  And then we had this info.
[11:53.520 --> 11:59.320]  That was kind of led by Alpine deciding not to use Proc Mem Info to figure out the memory
[11:59.320 --> 12:01.640]  usage in the free command.
[12:01.640 --> 12:05.320]  So you would run a container with a limit of, like, a gig of RAM and run free, and you
[12:05.320 --> 12:09.200]  would see 128 gigs, because it would just show you the host value directly.
[12:09.200 --> 12:13.640]  And that's because you look like CFS, which is another project we run, that overlays on
[12:13.640 --> 12:15.840]  top of Proc to show the right values.
[12:15.840 --> 12:18.160]  They don't work, because they were not reading the file system.
[12:18.160 --> 12:22.240]  They were going straight to the kernel with a system call using CS Info.
[12:22.240 --> 12:25.680]  So we've implemented Interception for that, and we've filled it with the same values as
[12:25.680 --> 12:31.520]  you would be getting from, like, CFS, and that gets us the right behavior.
[12:31.520 --> 12:38.920]  And so just switching to the demo, I'm also just rechecking something here real quick.
[12:38.920 --> 12:43.920]  Okay, I was just making sure that Christian was wrong with the time.
[12:43.920 --> 12:44.920]  That's good.
[12:44.920 --> 12:45.920]  I've got until 3.
[12:45.920 --> 12:49.920]  You showed me at 10 minutes.
[12:49.920 --> 12:51.760]  All right.
[12:51.760 --> 12:57.040]  So let's just move here to the terminal.
[12:57.040 --> 13:03.960]  And the first thing I'll do is play with Makenode, so, and that should be, yep, that's all right.
[13:03.960 --> 13:17.200]  It's launching a new container, Ubuntu 22.04, and let's try Makenode, I'm sure I'm getting
[13:17.200 --> 13:22.840]  those wrong again, because I always get them wrong, Makenode.
[13:22.840 --> 13:23.840]  Depending on the kernel.
[13:23.840 --> 13:24.840]  Makenode name, then type.
[13:24.840 --> 13:27.040]  Depending on the kernel, this might work.
[13:27.040 --> 13:30.960]  Yeah, it actually does work now.
[13:30.960 --> 13:35.440]  Let me figure out DevNode, this one shouldn't work, 1, 3.
[13:35.440 --> 13:36.440]  C5, 1.
[13:36.440 --> 13:40.920]  Yeah, C1, 3, for example, for DevNode does not work out of the box.
[13:40.920 --> 13:58.760]  But if we stop this, then set demo, makenode, security, calls, intercept, makenode, true.
[13:58.760 --> 14:01.800]  In this case, we do need to restart the container, because the entire second policy needs to
[14:01.800 --> 14:02.800]  change.
[14:02.800 --> 14:05.520]  For smaller changes, we don't need to, but for that we do.
[14:05.520 --> 14:06.520]  And now that works properly.
[14:06.520 --> 14:11.640]  So that's the Makenode piece.
[14:11.640 --> 14:12.760]  Then we've got Docker.
[14:12.760 --> 14:17.640]  For that one, I did prepare a tiny bit, because I did not feel like downloading Docker on
[14:17.640 --> 14:18.640]  the Wi-Fi here.
[14:18.640 --> 14:23.880]  I mean, actually, I did, but it took an hour, so happy I didn't do it during the talk.
[14:23.880 --> 14:27.040]  So for Docker, actually, let me show you the config first.
[14:27.040 --> 14:31.520]  So the container here has security nesting enabled, which allows for running containers
[14:31.520 --> 14:32.520]  in containers.
[14:32.520 --> 14:38.160]  And it has intercept of both Makenode and setExider that are set up.
[14:38.160 --> 14:45.400]  And in there, that part does use the network, so I'm just hoping that it's tiny enough.
[14:45.400 --> 14:46.600]  There we go.
[14:46.600 --> 14:48.280]  So that works properly.
[14:48.280 --> 14:52.080]  And the issue before was that unpacking the layers would just blow up.
[14:52.080 --> 14:53.840]  All right.
[14:53.840 --> 15:02.520]  So that was Docker, then to the more, to the fancier one, which is mount.
[15:02.520 --> 15:05.240]  All right.
[15:05.240 --> 15:08.360]  So for mount, launch the container here.
[15:08.360 --> 15:10.880]  I'm going to go and pass it a block device.
[15:10.880 --> 15:15.800]  So I'm passing it dev loop 11 on my system as dev SDA inside of the container.
[15:15.800 --> 15:19.200]  Yeah, your signs are still wrong.
[15:19.200 --> 15:22.200]  I've got 15.
[15:22.200 --> 15:25.720]  I mean, it's until three.
[15:25.720 --> 15:26.960]  Okay.
[15:26.960 --> 15:38.480]  So demo mount, make FS, EXT4 on the SDA, yes.
[15:38.480 --> 15:41.720]  So just formatting, that you can always do, like there's nothing preventing you from creating
[15:41.720 --> 15:42.720]  a file system.
[15:42.720 --> 15:44.240]  Normally, that works just fine.
[15:44.240 --> 15:49.680]  What doesn't work is this, like you're not allowed to actually mount it inside of a container.
[15:49.680 --> 15:52.640]  But now we can make that work, actually.
[15:52.640 --> 16:00.560]  So what we're going to be doing here is turning on mount interception.
[16:00.560 --> 16:05.480]  And then we need to set an extra one, which is the one allowing specific file system.
[16:05.480 --> 16:08.040]  So in this case, EXT4.
[16:08.040 --> 16:12.160]  And then restarting the container here.
[16:12.160 --> 16:13.160]  Okay.
[16:13.160 --> 16:19.160]  Exit back in there and try mounting again, and that works.
[16:19.160 --> 16:26.640]  And if we look at this, and we look at DF, it's mounted normally, it's fine.
[16:26.640 --> 16:30.480]  Other than it actually did this as your route, and I could have done very nasty things by
[16:30.480 --> 16:33.040]  crafting a particular device ahead of time.
[16:33.040 --> 16:35.880]  It works as expected.
[16:35.880 --> 16:41.840]  Now, what we can do to make things a bit more interesting here, actually, we did back in
[16:41.840 --> 16:48.440]  there, let's unmount it, and then install fuse to FS.
[16:48.440 --> 16:53.640]  That's a fuse implementation of EXT234.
[16:53.640 --> 16:55.520]  That's pretty readily available.
[16:55.520 --> 16:57.320]  You can install that.
[16:57.320 --> 17:01.840]  And then what we need to do is remove the config key that allowed straight up mounting
[17:01.840 --> 17:02.840]  of EXT4.
[17:02.840 --> 17:09.200]  And we replace it with another config key that instead says EXT4 is fuse to FS.
[17:09.200 --> 17:17.960]  So we put that in there, go back in, and then do SDA back to M&T.
[17:17.960 --> 17:22.400]  And the funny thing is that you won't actually notice any difference whatsoever.
[17:22.400 --> 17:28.320]  It actually looks completely identical unless you go look at proc mounts at which point
[17:28.320 --> 17:34.160]  you're going to notice that the file system here is not EXT4, it's fuse.exe4.
[17:34.160 --> 17:36.760]  And if you look at the process list, you're going to notice that there's an extra process
[17:36.760 --> 17:39.840]  running in your container now.
[17:39.840 --> 17:44.880]  So that's pretty sweet, because it means we can literally forward any file system to
[17:44.880 --> 17:50.960]  fuse, because it's done at the Cisco layer, the container doesn't really need to be aware
[17:50.960 --> 17:51.960]  of it.
[17:51.960 --> 17:53.600]  Like it is not doing anything to the mount command.
[17:53.600 --> 17:57.600]  You can just call the mount Cisco at any point, and it will just forward it to fuse
[17:57.600 --> 18:01.440]  and do the right thing for you, which means no chance you work yours whatsoever.
[18:01.440 --> 18:03.320]  It just works.
[18:03.320 --> 18:06.320]  So that is pretty cool.
[18:06.320 --> 18:14.240]  And the last thing I wanted to show in the demo is for launch a Alpine Edge container.
[18:14.240 --> 18:16.480]  So that's going to be the most info.
[18:16.480 --> 18:20.160]  And I'm going to set a memory limit of one gig.
[18:20.160 --> 18:29.680]  So if I go in there now, and I look at the free memory, we can see I've got 16 gigs,
[18:29.680 --> 18:32.240]  which is considerably more than one gig.
[18:32.240 --> 18:34.400]  The enforcement is in place.
[18:34.400 --> 18:37.600]  So that's where problems happen, is that the enforcement is in place.
[18:37.600 --> 18:42.040]  Now if you run something that will look at the free memory output to figure out how much
[18:42.040 --> 18:45.920]  memory it can claim, it's going to claim the wrong amount of memory, and it will get
[18:45.920 --> 18:48.200]  killed by the out-of-memory killer.
[18:48.200 --> 18:55.280]  So that's a problem, which is why we did the work to fix that shoe system call interception.
[18:55.280 --> 19:02.560]  So you can do security, syscall, intercept, sys, info, sure.
[19:02.560 --> 19:19.960]  And then bounce that container.
[19:19.960 --> 19:22.800]  And if I go in there and I look at free, now we've got a gig.
[19:22.800 --> 19:24.840]  So that actually works properly.
[19:24.840 --> 19:26.560]  It also fixes a bunch of other things.
[19:26.560 --> 19:31.000]  It doesn't just do the memory, but it also does CPU load, uptime, and a bunch of other
[19:31.000 --> 19:32.000]  things.
[19:32.000 --> 19:34.640]  So that's how StrapSysInfo is now properly handled.
[19:34.640 --> 19:38.920]  It's just the easiest use case we have is free on uptime, because we know that this
[19:38.920 --> 19:43.080]  command has been changed to use sysinfo instead of the five system, so it's a very easy one
[19:43.080 --> 19:45.600]  to just prove the concept works.
[19:45.600 --> 19:48.320]  But that particular piece of work fixed a lot of other things.
[19:48.320 --> 19:52.960]  I think it also improved Java, was using the wrong interface and the wrong amount of memory
[19:52.960 --> 19:54.640]  sometimes, so that fixed that.
[19:54.640 --> 19:55.840]  It fixed a bunch of other stuff.
[19:55.840 --> 19:58.840]  That was pretty good to have.
[19:58.840 --> 20:08.960]  So looking forward, where do we want to take this?
[20:08.960 --> 20:13.320]  We've got most of what we wanted covered, really.
[20:13.320 --> 20:18.560]  The big items that are really problematic have been resolved.
[20:18.560 --> 20:23.520]  Docker was a big one that we really wanted to solve, and it's working fine.
[20:23.520 --> 20:25.280]  Alpine behaving, that's really nice.
[20:25.280 --> 20:26.280]  We're happy with it.
[20:26.280 --> 20:29.840]  The monitor section allows for a lot of different stuff now.
[20:29.840 --> 20:32.840]  It's possible to do things like image building instead of an previous container.
[20:32.840 --> 20:35.200]  You can really do a lot.
[20:35.200 --> 20:40.440]  And Android seems to be happy, too, with the tiny bit of stuff we added for that.
[20:40.440 --> 20:45.360]  The things we'd like to add now are kind of weird stuff, really.
[20:45.360 --> 20:49.320]  The one I've got in mind mostly for the sake of it, but also because I'm sure we found
[20:49.320 --> 20:53.880]  a use case for it, is to implement init module and finit module.
[20:53.880 --> 20:58.760]  So the ability to do kernel module loading from inside of an previous container, which
[20:58.760 --> 21:04.640]  is as terrifying as it sounds, but the idea there is that we would not actually allow
[21:04.640 --> 21:09.320]  for the container to feed us the actual object file that we would then store at the counter.
[21:09.320 --> 21:12.720]  What we would do is we would receive that object file, we'd look at what the kernel module
[21:12.720 --> 21:18.040]  name is, look against an alias that we have of kernel modules that are finitely loaded
[21:18.040 --> 21:23.720]  on the system, and if it's in that list, then we'll do the loading using the module from
[21:23.720 --> 21:26.840]  the host system that we know is correct.
[21:26.840 --> 21:30.480]  This might help quite a bit with things like firewalls in containers that might need to
[21:30.480 --> 21:35.280]  load custom IP tables or net filter modules, and potentially some other things like file
[21:35.280 --> 21:37.600]  systems and other things.
[21:37.600 --> 21:42.520]  So that would be an interesting one to implement, and I'm sure we're going to have to explain
[21:42.520 --> 21:45.240]  to a lot of people exactly how it is that we're doing it, because otherwise we're going
[21:45.240 --> 21:49.200]  to be absolutely terrified.
[21:49.200 --> 21:54.360]  Before eBPF program handling, I think it's also in our plans, as I said, we currently
[21:54.360 --> 22:01.920]  only intercept the eBPF program that's used for device management, so for allowing device
[22:01.920 --> 22:06.520]  creation, device mapping, that kind of stuff within containers.
[22:06.520 --> 22:13.000]  That's because C Group 2 removed the device's C Groups file interface and moved to eBPF
[22:13.000 --> 22:16.200]  instead, so we implemented it that way.
[22:16.200 --> 22:21.080]  We seem to have some interest for other programs, like I think SystemD and some other pretty
[22:21.080 --> 22:26.480]  common pieces of software now generate eBPF hook that they hook either globally or on
[22:26.480 --> 22:30.160]  two specific interfaces, and some of those should be saved.
[22:30.160 --> 22:34.720]  That should be things that we can effectively pull, validate, that they match the expected
[22:34.720 --> 22:40.880]  pattern, and if they do, then show that to the camera, this is fine, and that should
[22:40.880 --> 22:46.480]  actually make a lot more newer software that make use of a lot of eBPF stuff to just start
[22:46.480 --> 22:47.480]  working.
[22:47.480 --> 22:51.840]  I don't think we're anywhere near getting something like a eBPF trace working safely
[22:51.840 --> 22:52.840]  inside a container.
[22:52.840 --> 22:56.280]  That's absolutely terrifying, because it's got access to all of the candle constructs,
[22:56.280 --> 23:04.400]  and that's not something we do, but some subsets of those interfaces should definitely be fine.
[23:04.400 --> 23:11.240]  I think eBPF will solve a lot of those problems, probably, because then you can load unprivileged
[23:11.240 --> 23:13.600]  programs.
[23:13.600 --> 23:19.080]  And the other thing that I've had in mind for a while, and it's mostly a cool thing,
[23:19.080 --> 23:25.480]  not something I've actually had the use case for yet, SecComp has an interesting property
[23:25.480 --> 23:28.200]  in that it runs extremely early.
[23:28.200 --> 23:36.240]  It runs in the system called entry time in the kernel, before the system call is resolved.
[23:36.240 --> 23:39.760]  That means we can intercept system calls that don't exist.
[23:39.760 --> 23:44.480]  So we can intercept system call numbers that have not yet been allocated, and that means
[23:44.480 --> 23:50.160]  we get to actually implement new system calls purely in user space, that you can access
[23:50.160 --> 23:53.720]  through the normal kernel system call API.
[23:53.720 --> 23:58.800]  That's super interesting because it lets you do very easy prototype and testing of potential
[23:58.800 --> 23:59.800]  system calls.
[23:59.800 --> 24:03.600]  If you want to try specific interfaces, see how they look, the layout, what kind of arguments
[24:03.600 --> 24:09.440]  you want, you can pretty quickly implement system calls through that, and already show
[24:09.440 --> 24:13.400]  user space software added until you're happy with it, at which point you go back and you
[24:13.400 --> 24:16.560]  do the actual kernel implementation of the system call.
[24:16.560 --> 24:18.000]  So that might be pretty interesting.
[24:18.000 --> 24:23.160]  I don't think anyone has actually done that yet, but that's a nice property of how SecComp
[24:23.160 --> 24:28.760]  works, that it works before any kind of resolution, any kind of validity of the system call number.
[24:28.760 --> 24:32.520]  All that SecComp tells us is actually a system call number and all of the pointers through
[24:32.520 --> 24:33.920]  the arguments.
[24:33.920 --> 24:36.240]  It doesn't care whether the thing exists or not.
[24:36.240 --> 24:43.280]  So it means we get to actually intercept things that don't exist.
[24:43.280 --> 24:44.960]  And that's it.
[24:44.960 --> 24:47.560]  So we can start getting a few questions.
[24:47.560 --> 24:50.760]  Also on your way out, if you're interested, we do have legacy stickers on the table over
[24:50.760 --> 24:51.760]  there.
[24:51.760 --> 24:54.640]  If you want to help yourself, there's a question over there.
[24:54.640 --> 24:56.920]  I think this was the first one.
[24:56.920 --> 25:02.040]  Yeah, there's one here, there's one over there.
[25:02.040 --> 25:09.560]  When will we see the sysinfo system call being intercepted by default on LXD or other distributions?
[25:09.560 --> 25:10.640]  Sorry.
[25:10.640 --> 25:15.480]  When will we see sysinfo calls being default intercepted?
[25:15.480 --> 25:17.720]  This is going to roll out?
[25:17.720 --> 25:22.200]  Yeah, we've currently not decided to do any of that by default.
[25:22.200 --> 25:26.400]  Please leave quietly while we are answering questions.
[25:26.400 --> 25:31.840]  Yeah, so we've not decided to intercept anything by default yet.
[25:31.840 --> 25:33.280]  We consider it to be safe.
[25:33.280 --> 25:36.800]  The main problem we have is it depends on the kernel version that you're running, whether
[25:36.800 --> 25:38.800]  it's going to be working or not.
[25:38.800 --> 25:43.200]  And it's still recent enough, even though it's 5.1, which has been around a while now,
[25:43.200 --> 25:46.960]  it's still recent enough that a bunch of distros would not work properly.
[25:46.960 --> 25:51.040]  So we want to wait until we can generally assume that all of the distros that are like all
[25:51.040 --> 25:55.480]  the long term support releases are still supported before we can start doing that kind of stuff
[25:55.480 --> 25:57.880]  by default.
[25:57.880 --> 26:00.280]  Please keep it down while we are answering questions.
[26:00.280 --> 26:01.280]  Thank you.
[26:01.280 --> 26:02.280]  Hello.
[26:02.280 --> 26:03.960]  Thanks for the great talk.
[26:03.960 --> 26:05.400]  So I have two questions.
[26:05.400 --> 26:11.000]  First of all, you said there is this time of check versus time of use issue.
[26:11.000 --> 26:12.000]  And so how do you solve it?
[26:12.000 --> 26:16.360]  It's still trying to give a question, but I can't hear anything, hold on.
[26:16.360 --> 26:20.840]  So first of all, how do you fix this time of check versus time of use issue, where you
[26:20.840 --> 26:27.320]  know you call a syscall, the syscall gets notified, and you can, well, raise it with
[26:27.320 --> 26:30.680]  another thread and change some arguments, right?
[26:30.680 --> 26:34.280]  I didn't really get that, but if Christian did, you can answer instead, because you probably
[26:34.280 --> 26:42.280]  know it.
[26:42.280 --> 26:44.720]  It's extremely noisy.
[26:44.720 --> 26:47.600]  Now, okay, I'm going to try this.
[26:47.600 --> 26:51.480]  Stefan, how do you fix the time of check, time of use issue?
[26:51.480 --> 26:57.200]  Okay, so the time of check, time of use issue, you fix it by never letting the kernel execute
[26:57.200 --> 26:59.320]  after the check.
[26:59.320 --> 27:04.280]  So you never continue a system call after the check, effectively.
[27:04.280 --> 27:09.880]  If you want to intercept a system call, you are now in charge of running it.
[27:09.880 --> 27:14.520]  And so you copy the arguments as they are, you do the check on your copy of them.
[27:14.520 --> 27:19.560]  You never, ever reuse the pointer that the user gave you, and you go with your own copy
[27:19.560 --> 27:22.440]  of it, and that's perfectly safe.
[27:22.440 --> 27:26.200]  But if the argument is a pointer to a string, you need to copy the string, and when you
[27:26.200 --> 27:28.880]  are copying the string, it may be changed under the hood.
[27:28.880 --> 27:33.720]  So are you actually freezing the process with something like the C group, the freezer C
[27:33.720 --> 27:35.320]  group, for example?
[27:35.320 --> 27:39.920]  So technically the calling thread is frozen by the kernel, but it doesn't prevent another
[27:39.920 --> 27:44.400]  parallel thread to modify it, which is why we effectively map the memory of the process
[27:44.400 --> 27:48.200]  with the, we copy the entire thing that we care about.
[27:48.200 --> 27:52.200]  The entire, like if there's pointer of pointers, we just travel start, we copy it.
[27:52.200 --> 27:55.840]  Once we've copied that, that's what we check policy against.
[27:55.840 --> 28:00.840]  And that's what the, those are the arguments we're going to be passing to the actual kernel.
[28:00.840 --> 28:05.680]  And we just never look back at what came from the process, which means if they try to raise
[28:05.680 --> 28:07.720]  us at that point, it doesn't matter.
[28:07.720 --> 28:11.720]  We create full copies, we create full copies of everything, we never continue the system
[28:11.720 --> 28:17.280]  call, although that's an ability that I added a while back, so you can even say, continue
[28:17.280 --> 28:21.560]  the system call if I come to the conclusion that it's fine to do so.
[28:21.560 --> 28:29.240]  But if you do that, then you need to be, the kernel needs to guarantee you that it's safe.
[28:29.240 --> 28:34.280]  For example, continuing the make not system call after you inspected the arguments is
[28:34.280 --> 28:42.160]  safe because the kernel will just allow the creation of any device.
[28:42.160 --> 28:51.040]  So I have another question, because you said about MK not that if you MK not add like this
[28:51.040 --> 28:57.640]  device that nothing protects you against reading or writing into this, right?
[28:57.640 --> 29:03.760]  But there is this devices C group where you can actually protect this device from being
[29:03.760 --> 29:05.200]  written to or read from.
[29:05.200 --> 29:07.960]  And this is what Docker does, for example.
[29:07.960 --> 29:11.280]  So are you doing this in LXC?
[29:11.280 --> 29:13.960]  I'm sorry, I only get about 20% of what you're saying.
[29:13.960 --> 29:17.200]  Well of times what we'll do is that I'm going to be outside and we can just talk because
[29:17.200 --> 29:19.000]  you also have questions.
[29:19.000 --> 29:22.040]  So just follow me and we'll chat, it's going to be easier.
[29:22.040 --> 29:33.480]  Thank you very much.