[00:00.000 --> 00:12.120]  Yeah, hi, I'm Ludwig from SUSE. I'm a research engineer there working in the so-called future
[00:12.120 --> 00:17.640]  technologies team and today I'm presenting some crazy idea. Not because we're going
[00:17.640 --> 00:22.920]  to build a product with that, but just because we can. So, first let's take a look what's
[00:22.920 --> 00:27.880]  the difference between package-based and image-based. I mean, I'm from the package-based world,
[00:27.880 --> 00:33.040]  so I don't have too many insights into actual image-based systems like I'm embedded world,
[00:33.040 --> 00:38.520]  so that's just my view. Anyway, packages are known by all of you, I guess, from your desktop
[00:38.520 --> 00:46.120]  Linux installation, so you have individual components pre-built, the vendor ships your
[00:46.120 --> 00:51.360]  packages and the client side decides which one to install, so there's some kind of dependency
[00:51.360 --> 00:56.800]  resolver that takes the components, puts them on your local system. Usually you don't need
[00:56.800 --> 01:02.160]  to reboot for that. This has advantages and disadvantages, so the advantages, if you need
[01:02.160 --> 01:07.200]  a new VIM version, you just get it and it works right away. The disadvantages, if it breaks,
[01:07.200 --> 01:14.520]  it's broken. On image-based systems, on the other hand, you get a full Linux system pre-built by
[01:14.520 --> 01:21.360]  the operating system vendor. That means it's a ready-made file system that is typically downloaded
[01:21.360 --> 01:27.960]  and DD to some partition, for example, or it's a table that gets extracted somewhere. Installing
[01:27.960 --> 01:34.440]  that one or activating the install requires a reboot, so that's an advantage because it wouldn't
[01:34.440 --> 01:40.520]  break into intermediate states, but it just either works or doesn't. The disadvantage is that it's
[01:40.520 --> 01:44.720]  typically not extendable, at least not the original image, so you would have to have some other
[01:44.720 --> 01:53.240]  mechanism like system desizics nowadays that plays some tricks to get extra layers on top. So,
[01:53.240 --> 01:59.800]  let's first take a look at the file-stem layout of a typical image-based system, or one, I think,
[01:59.800 --> 02:05.360]  our system, the envisions it. So, we have the operating system in slash user for Linux systems
[02:05.360 --> 02:13.240]  after user merge. Then the EDC partition at the EDC volume is on slash, which is writable. The boot
[02:13.240 --> 02:18.120]  partition nowadays should be the ESP, no matter where it's mounted actually, and your data is in
[02:18.120 --> 02:25.240]  var on a separate volume again. So, how do you do updates? Usually you have separate partitions
[02:25.240 --> 02:33.840]  for user, so the standard case is AP, so two versions, and then you DD your operating system
[02:33.840 --> 02:39.920]  into one, while the other one runs, and on reboot you just switch. That's quite an easy
[02:39.920 --> 02:45.720]  technology because it's just a regular partition table. It's actually read-only if the file-stem
[02:45.720 --> 02:52.160]  you use is a read-only one. You could use our single CAsync to download deltas and getting it
[02:52.160 --> 02:57.240]  from the server, and the signing is also pretty easy because it's one image, so you can put some
[02:57.240 --> 03:01.320]  GPG signature on it, and that's it, and you can even verify it afterwards because the image is
[03:01.320 --> 03:08.000]  unmodified on your partition. Disadvantages, there's no deduplication, so you always consume
[03:08.000 --> 03:14.520]  twice the space basically, or however big your user partition is. Even if your updates are small,
[03:14.520 --> 03:22.840]  you still need all that space. The amount is limited, so usually if you have only two partitions,
[03:22.840 --> 03:27.600]  you have two versions of the operating system, and the space is pre-allocated. Again, that could
[03:27.600 --> 03:33.760]  be an advantage because there's no surprises, the space is just there, and if it's there,
[03:33.760 --> 03:38.520]  you can put the image, period. The disadvantages, your image can't grow, and the updates are always
[03:38.520 --> 03:46.640]  of a fixed size basically. So I can be optimized that. You could actually use butterfaces to our
[03:46.640 --> 03:53.000]  operating system, that's how the micro-S works, some more details on that later in Ignite's talk.
[03:53.000 --> 04:00.960]  Anyway, we use a sub-volume for the operating system. That means you get the copy and write
[04:00.960 --> 04:07.520]  semantics automatically, so deduplication means your updated system does not automatically need
[04:07.520 --> 04:13.320]  twice the space, but only the changed amount. You can still use rsync or casync to only apply
[04:13.320 --> 04:19.000]  deltas. The amount of versions you can store is basically infinite, only depends on how big your
[04:19.000 --> 04:26.160]  updates are. So if you only have updates on text files, it could be a lot of versions. Disadvantages,
[04:26.160 --> 04:32.120]  it's not really read-only, it's just a butterfaces flag, and that can be changed of course. Also,
[04:32.120 --> 04:37.560]  put a question mark on verification. So previously with image, we could just run GPG for example,
[04:37.560 --> 04:43.520]  and you can verify whether the image was modified or not. Here we have to take a look how to solve
[04:43.520 --> 04:50.160]  that later. But how do distributions actually build those images? At least an operating system
[04:50.160 --> 04:56.480]  vendor like OpenSUSE would use packages to build the image, just on server side. We learned that
[04:56.480 --> 05:02.200]  we can even use this technology for building any of these, and the way the image would be shipped
[05:02.200 --> 05:06.800]  would be just install packages somewhere on a server system and then throw away everything that is
[05:06.800 --> 05:14.080]  not in user. That means all the scriptlets that run in packages are just modifying something that's
[05:14.080 --> 05:19.600]  not relevant. Like in EDC and VAR, it doesn't make any sense to have a scriptlet doing something
[05:19.600 --> 05:24.800]  there. So when, for example, a package needs to add a user, it can't just call user add. It needs to
[05:24.800 --> 05:30.600]  use SystemDISUS users. Same if you don't enable a service, you can't just call SystemControlEnable,
[05:30.600 --> 05:37.600]  you need to ship presets. And that way packages also can't just put the kernel in slash boot
[05:37.600 --> 05:42.320]  anymore because that's not in user. That means there needs to be some extra tool that somehow
[05:42.320 --> 05:52.080]  makes a system bootable when there's a new kernel, for example. So back to verification. Actually
[05:52.080 --> 06:00.560]  packages, at least RPM packages, have signed headers, and the headers have check sums for each
[06:00.560 --> 06:11.200]  file. So in the end, an image is a list of RPMs with signed headers, and by verifying each header,
[06:11.200 --> 06:16.920]  you can also verify each file. So in the end, you have a tree that could be verified, you can
[06:16.920 --> 06:22.320]  check that there's no file added, no file removed, and no file modified, just by looking at the
[06:22.320 --> 06:29.320]  RPM headers. Disadvantages and images that you ship, you typically remove the RPM database
[06:29.320 --> 06:34.360]  because it's this ugly binary blob. Even if it's a SQLite database, it's still an ugly binary
[06:34.360 --> 06:40.480]  blob with more binary blobs in there. So that's why people really hate having the RPM database
[06:40.480 --> 06:46.280]  there. This is something that nobody wants to see in an image. So how can we fix that? We could
[06:46.280 --> 06:54.200]  actually store the RPM headers as files. So we just dump the header part of the package into a
[06:54.200 --> 07:00.400]  directory. That means the directory is the RPM database that looks much less ugly. Different
[07:00.400 --> 07:07.400]  two file directory listings to actually see which packages could update it. That is quite useful
[07:07.400 --> 07:13.240]  if you already use Microsoft, for example, and do some snapper compare, then it tells you this
[07:13.240 --> 07:19.040]  RPM database and this RPM database change, but you don't know what. If the database is a listing
[07:19.040 --> 07:26.720]  of files, it's naturally visible what changed. And still we have the RPM header, so users fully
[07:26.720 --> 07:35.040]  verifiable because they're signed. So the question now is, what do we actually need an image for?
[07:35.040 --> 07:41.800]  So you don't need to take those RPMs, put them into some file system, and then download the whole
[07:41.800 --> 07:49.160]  file system or this image or table. You could actually define an image as a text file that lists
[07:49.160 --> 07:57.520]  RPMs. And then you download those RPMs, which are bits of your image, and just put them into this
[07:57.520 --> 08:05.640]  file system or partition that is user. Disadvantage of this method again would be that with nowadays
[08:05.640 --> 08:11.280]  RPMs you would lose the ability to do easy deltas because the payload is actually a CPO that is
[08:11.280 --> 08:17.000]  compressed with some compression. So if you don't want to use delta RPMs or other fancy things,
[08:17.000 --> 08:23.040]  you need to find a solution for the payload of RPMs. So the payload doesn't actually have to be a
[08:23.040 --> 08:29.560]  CPO that is compressed. It could be actually completely uncompressed. So you just concatenate
[08:29.560 --> 08:35.120]  all the file contents at the end of the RPM header. And because RPM header contains also the file
[08:35.120 --> 08:45.240]  sizes, you know which file is where. Now if we do another trick and align those file datas to
[08:45.240 --> 08:51.960]  page size, you actually get reflinkable packages. That means you could download this uncompressed
[08:51.960 --> 08:57.120]  RPM, for example by means of CAsync, which would compress it actually on the server side. Then
[08:57.120 --> 09:03.280]  you have the RPM on disk. And then instead of copying the payload to some other location,
[09:03.280 --> 09:10.720]  you just use reflinks as a file system feature that reuses the blocks. So you have this one big
[09:10.720 --> 09:17.120]  chunk of data as the RPM. And to create your actual files, you just link the data into there.
[09:17.120 --> 09:25.240]  That means user bin bash is not actually a copy, it's just sharing the same data that is in this
[09:25.240 --> 09:30.160]  RPM that is stored there, which conveniently is at the same time your RPM database. So it's not
[09:30.160 --> 09:37.120]  just the headers that are in this directory, it is the full package. So in the end, in this example,
[09:37.120 --> 09:45.480]  I put userLibsusImagePackages. UserLibsusImagesPackages would be your image. That means users just
[09:45.480 --> 09:51.560]  a view. And if you would map this into ButterFS, like in this example, that means you have several
[09:51.560 --> 09:56.880]  versions of this image directory as a snapshot. And then you could create other ButterFS volumes
[09:56.880 --> 10:05.600]  that just link into those RPM headers. Or you could even omit this user completely, a colleague
[10:05.600 --> 10:11.000]  of mine Fabian, even wrote a fuse plugin that just creates a file system from looking at RPM,
[10:11.000 --> 10:19.880]  RPM's in the directory. Quite crazy. So to summarize, we could build an image-like system by
[10:19.880 --> 10:26.400]  leveraging ButterFS. So instead of using AP partitions, we just use snapshots. The behavior
[10:26.400 --> 10:30.680]  would be exactly the same because you prepare this new snapshot, put all your data in there,
[10:30.680 --> 10:35.920]  and then you have to reboot to activate it. But since it's still packages, you retain the
[10:35.920 --> 10:41.240]  flexibility to actually change it on client side. You could ship the image as a list of RPM's on
[10:41.240 --> 10:46.480]  text file, but you could also add RPM's locally in this directory, and you have them installed.
[10:46.480 --> 10:52.640]  So best of both worlds. And this is not just completely crazy in my head. I actually built
[10:52.640 --> 10:59.800]  a prototype that kind of works. So it uses busybox because it was easy to modify the RPM
[10:59.800 --> 11:04.560]  implementation in there, to work with those raffle link in the packages. This is our sync for
[11:04.560 --> 11:11.520]  updates, and it uses SystemD's kernel installed to make this system actually bootable. So you
[11:11.520 --> 11:17.800]  can try it out. There's also pull request open for RPM, I think, to have this raffle linkable
[11:17.800 --> 11:22.480]  stuff and send a patch to busybox, but I don't expect it to be accepted. It's just a proof of
[11:22.480 --> 11:27.680]  concept. Of course, to make this work in practice, there's lots of to-do's. So in existing distros
[11:27.680 --> 11:32.760]  like OpenSUSE, we have to fix all the packages to no longer use scriptlets. We need to talk about
[11:32.760 --> 11:37.160]  the butterfly sub volumes. The naming should be standardized. There was actually a discussion
[11:37.160 --> 11:43.040]  like two years ago already on the list. RPM raffle link payload would be nice to have upstream.
[11:43.040 --> 11:48.880]  There are other stakeholders that also would like to have that. I'm already working on SystemD's
[11:48.880 --> 11:56.960]  kernel installed to make it usable for this use case. In case of micro-as-like systems,
[11:56.960 --> 12:02.960]  we want to have roll-baggles of slash and just the operating system. For installing deltas, I would
[12:02.960 --> 12:08.520]  like to use the async, which we revisited. And last but not least, all of that should be native
[12:08.520 --> 12:16.240]  in RPM or SIP, and in my opinion, not just some extra tool. And that's it already. So any questions
[12:16.240 --> 12:29.000]  for this crazy stuff? Okay, me first. So how did you handle the timestamp AC? Because if you're
[12:29.000 --> 12:34.800]  using BTRFS and doing an RPM minus I, you get the A time, M time, and C time in the i-node. And
[12:34.800 --> 12:41.040]  an image-based system is supposed to hash end to end the same. And if it has different timestamps,
[12:41.040 --> 12:46.040]  it doesn't hash end to end. Yeah, well, you can't modify the timestamps. You can't modify. You can
[12:46.040 --> 12:52.120]  only look at the ones that are under your control. But then doesn't that mean that effectively we
[12:52.120 --> 12:58.080]  can't use this to reconstruct image-based systems? Well, it's not a bit-to-bit identical. What is
[12:58.080 --> 13:03.200]  on disk is not bit-to-bit identical, of course. Only the actual RPM payload. But you know that
[13:03.200 --> 13:11.120]  the payload that is linked is actually the same. So I don't think, I mean, maybe there's a use case
[13:11.120 --> 13:16.320]  why the timestamps need to be exact. But in my view, it's not important because the, like,
[13:16.320 --> 13:22.960]  user bin bash is user bin bash, no matter what the timestamp is. Well, the use case is just for
[13:22.960 --> 13:28.320]  image-based systems. The end-to-end hash tells you that you've done the right thing. And it's
[13:28.320 --> 13:33.440]  simple to compute. With your system, you basically have to do a tree hash down all the packages
[13:33.440 --> 13:38.400]  to prove that this is what you're supposed to have installed. It's, semantically, they are
[13:38.400 --> 13:42.720]  equivalent. It's just the latter is more difficult to do than the former, which is why people like
[13:42.720 --> 13:50.600]  image-based. I mean, you only need to hash the directory with all the RPMs in them, right? So
[13:50.600 --> 13:56.160]  if the RPMs, you have to check some of all the RPMs, and then you can verify the RPMs. Yeah,
[13:56.160 --> 13:59.840]  I'm not disputing. Exactly. You're just saying it's more complicated. Yeah, it's, of course,
[13:59.840 --> 14:04.000]  there's always a trade-off. Of course, yeah, yeah. And depends on how you construct this user view.
[14:04.000 --> 14:07.840]  I mean, if it's really a butterfly, it's a real file system tree, then you have to walk it. But if
[14:07.840 --> 14:12.640]  it's only a view, like with a view stingy, then it doesn't actually exist on disk. It's just,
[14:13.280 --> 14:17.600]  you know, looking at the RPMs. So it cannot be modified. You need to walk, don't need to walk
[14:17.600 --> 14:28.240]  the tree. You just hash the RPMs. Okay. Yeah. So how would you integrate that into what we have
[14:28.240 --> 14:35.120]  heard in the previous talks? So during boot, I specify something similar to my dm-varity root
[14:35.120 --> 14:40.800]  hash, so that I know that I'm actually booting from an unmodified root of s. That's a good question.
[14:40.800 --> 14:48.080]  That problem is not solved yet. Yeah. So far, the challenges are already at the point, how do you
[14:48.080 --> 14:55.440]  actually specify which version of the user tree do you want to boot. But now all the models assume
[14:55.440 --> 14:59.600]  that if you ship a new image, like a new user version, it also comes with a new kernel and a
[14:59.600 --> 15:07.120]  new init.d. So this init.d knows what disk to boot. But in the Butterfest use case, there is
[15:07.120 --> 15:13.840]  a kernel that can have a number of init.d.s and those init.d.s match with a number of snapshots
[15:13.840 --> 15:19.760]  that they can actually boot. So then I'm already struggling with this part. So the verification
[15:19.760 --> 15:34.480]  comes later. So dm-varity gives you authenticity of the blocks at runtime. So the device cannot
[15:34.480 --> 15:41.280]  switch them underneath. And I guess that if you verify the kernel headers, the RPM headers,
[15:43.600 --> 15:48.800]  I don't know, when loading the image, this wouldn't give you the same runtime properties.
[15:50.320 --> 15:55.680]  I haven't played with those technologies yet, to be honest. So another interesting thing would be
[15:55.680 --> 16:03.760]  this FA policy daemon. Only with a bit about it, it uses the audit subsystem to actually block access
[16:03.760 --> 16:09.360]  to modified files by comparing them with the information in the RPM header. So
[16:10.000 --> 16:16.160]  it would be another area to just explore how to integrate some verification technologies into
[16:16.160 --> 16:35.680]  this model. Any more questions? If not, then I guess we'll wrap up this talk. Thank you very much.