[00:00.000 --> 00:12.880]  So next speaker is Todd Gamblin and as I think a lot of people here know I'm very much involved
[00:12.880 --> 00:18.800]  in the EasyBuild project which was actually the excuse we used to start the HPC Dev Room
[00:18.800 --> 00:23.160]  but we're also very open to other projects which are very similar to what we work on
[00:23.160 --> 00:28.480]  every day so in some sense back as our mortal enemy but we do allow them to give talks in
[00:28.480 --> 00:36.720]  the Dev Room as well. Yeah, with that, thanks. Okay, who's heard of SPAC? Okay, cool. People
[00:36.720 --> 00:42.680]  have heard of SPAC. We don't need to do too many introductions for this talk. This is
[00:42.680 --> 00:47.560]  less of a talk about SPAC and more of a talk about the CI that we've started doing since
[00:47.560 --> 00:52.320]  introducing binary packages in SPAC. I don't think I need to tell people why they need
[00:52.320 --> 00:58.360]  SPAC for HPC. I think lots of folks have talked about that already today. Harman said
[00:58.360 --> 01:02.320]  that, so I'm supposed to talk about a little bit about deployment. To deploy SPAC, if you
[01:02.320 --> 01:05.440]  want to try it on a new system, just clone it from the Git repo and run it. Like all
[01:05.440 --> 01:10.040]  you need is Python and a few other tools on your system to do that. So you can just run
[01:10.040 --> 01:13.080]  it straight out of the repo if you want to play around with it and build stuff right
[01:13.080 --> 01:18.880]  there. SPAC is designed to install lots of different versions of things like others
[01:18.880 --> 01:23.400]  have said. This is sort of a snapshot of the syntax. Some of the things that you can add,
[01:23.400 --> 01:27.200]  you can install HDF5 at lots of different versions, you can inject flags in the build,
[01:27.200 --> 01:30.720]  you can pick a compiler, you can do all that on a fly and it will build you sort of a custom
[01:30.720 --> 01:34.880]  version of that software and let you use it. You can get it into your environment a lot
[01:34.880 --> 01:40.240]  of different ways. What we're trying to do with SPAC is provide the ease of use of mainstream
[01:40.240 --> 01:44.960]  tools that people are used to, but with the flexibility for HPC, whether we fully accomplish
[01:44.960 --> 01:49.200]  that is a whole other question because there's a lot of complexity still in this because
[01:49.200 --> 01:55.080]  it is intended for HPC. Originally it was designed to build from source because it
[01:55.080 --> 02:03.920]  was trying to automate people's common workflow. The Fermi lab and CERN folks added a first
[02:03.920 --> 02:08.720]  implementation of binary packaging to SPAC and I talked about some of that in a past
[02:08.720 --> 02:17.160]  FOSDM. Since then we've actually started relying on the build caches a lot more. SPAC has relocatable
[02:17.160 --> 02:22.120]  build caches that you can build in either a build farm or you can make one right out
[02:22.120 --> 02:28.280]  of your SPAC build. You may not want to do one yourself that way because then you won't
[02:28.280 --> 02:33.320]  have padding on the path. Like you said, the patch off relocation is dangerous. Generally
[02:33.320 --> 02:39.560]  if we build binaries for wide use, we pad the paths pretty extensively so that we can just
[02:39.560 --> 02:44.240]  poke values in instead of having to do all the patch off stuff. Anyway, you can install
[02:44.240 --> 02:49.080]  SPAC binaries from a build cache in S3 to your home directory. You can make a common
[02:49.080 --> 02:54.880]  build cache in the file system. You can use a build cache to accelerate CI. It's very
[02:54.880 --> 03:04.760]  handy because it eliminates the need to rebuild lots of stuff all the time. If you look at
[03:04.760 --> 03:10.880]  the SPAC project as a whole, I think people know most of this. There's a community. We
[03:10.880 --> 03:15.200]  maintain the core tool. There's package recipes but the part that you don't see is all the
[03:15.200 --> 03:20.560]  infrastructure behind the scenes that keeps the thing working. Originally, we did not
[03:20.560 --> 03:26.960]  have CI for SPAC or at least not for the package builds. We've always had CI for the tool itself
[03:26.960 --> 03:31.640]  and we've done unit tests and checked a bunch of things about concretization and so on,
[03:31.640 --> 03:35.440]  but we weren't building all the packages. We're still not building all the packages,
[03:35.440 --> 03:39.680]  but we're building quite a few of them. With the infrastructure that we have, we have a
[03:39.680 --> 03:43.320]  system where essentially you can build lots of software stacks on top of SPAC. You can
[03:43.320 --> 03:47.840]  write an ML description of what you want in the software stack. You can have E4S, AWS
[03:47.840 --> 03:55.080]  stack, Livermore's, Math stack. There's a Viz SDK within the Exascale Computing Project.
[03:55.080 --> 03:59.840]  Every application is its own software stack these days. Our production codes have upwards
[03:59.840 --> 04:05.520]  of 100 dependencies that they are using for multi-physics. Each of them is essentially
[04:05.520 --> 04:11.560]  maintaining their own little private software distribution in some sense or another. We'd
[04:11.560 --> 04:16.640]  like to be able to build all of this stuff and ensure that these things keep working.
[04:16.640 --> 04:21.880]  That's hard to do given that the GitHub for SPAC is a pretty busy place. There's almost
[04:21.880 --> 04:27.840]  7,000 packages in SPAC now. Over the whole life of the project, there's been over 1,100
[04:27.840 --> 04:34.280]  contributors. You can see down there, this month, there's been 122 people active on the
[04:34.280 --> 04:43.440]  GitHub repo. Over 400 commits and 300 to 500 PRs per month that we have to merge. Ensuring
[04:43.440 --> 04:48.320]  that everything stays working without many changes is pretty hard and you'd be nuts
[04:48.320 --> 04:56.400]  to do it without CI. One of the problems that we have, though, is that CI for HPC is hard.
[04:56.400 --> 05:02.040]  If you want to test in the HPC environments that you actually care about, you can't just
[05:02.040 --> 05:07.320]  take an HPC node and hook it up to random pull requests on GitHub. They don't like that
[05:07.320 --> 05:10.520]  when the machine might have export-controlled software on it because you're effectively
[05:10.520 --> 05:19.400]  allowing some random person in a pull request to run software on your HPC machine. This
[05:19.400 --> 05:22.720]  is the model for SPAC. We have a bunch of external contributors on GitHub constantly
[05:22.720 --> 05:28.880]  contributing to this develop branch. We have stable release branches where we freeze the
[05:28.880 --> 05:34.800]  packages to reduce the churn that some people rely on. Most users are actually on develop,
[05:34.800 --> 05:38.920]  at least according to our surveys, which is a little surprising to me, but that's where
[05:38.920 --> 05:40.600]  we are.
[05:40.600 --> 05:45.160]  Then off of the release branches, there is a software distribution within the XSCL project
[05:45.160 --> 05:49.880]  in the U.S. called E4S. There's a few others that sort of freeze a commit from SPAC and
[05:49.880 --> 05:54.560]  do their own integration after that. That's really supposed to be the deployment mechanism
[05:54.560 --> 06:00.560]  for the 100 or so packages, and it's like 600 with dependencies, that are in ECP. What
[06:00.560 --> 06:07.680]  happens with this is that that gets deployed to the facilities, and the E4S team goes and
[06:07.680 --> 06:13.160]  ensures that everything works, but we're not able to run on these systems on pull requests
[06:13.160 --> 06:18.120]  in CI, and it's very frustrating. Essentially, this is a bunch of downstream work that we
[06:18.120 --> 06:23.160]  would really like to get rid of. Moreover, the applications are also doing downstream
[06:23.160 --> 06:27.280]  integration. They may have their own CI, which may be good. They're essentially pulling
[06:27.280 --> 06:31.000]  from all these places. They may pull a facility deployment. They may pull from develop. They
[06:31.000 --> 06:34.280]  may pull from a release. They're essentially integrating from all these places, and so
[06:34.280 --> 06:36.400]  there's a lot of downstream porting there.
[06:36.400 --> 06:41.160]  What we would ultimately like to do in SPAC is take all of that work that's going on downstream
[06:41.160 --> 06:46.800]  and move it upstream so that we're actually doing CI testing on develop along with everything
[06:46.800 --> 06:50.440]  else. This is progress towards that, but we're not doing that yet. Essentially, the main
[06:50.440 --> 06:56.840]  obstacle for us to build stuff that looks like the HPC environments right now is a
[06:56.840 --> 07:02.760]  licensing issue, which is that we can't take the CrayPE container and run it in the cloud,
[07:02.760 --> 07:06.800]  because that's just not something you can do with HPC's license. We are pushing them
[07:06.800 --> 07:11.560]  real hard on this and trying to get an exception for us to build things, in which case we would
[07:11.560 --> 07:17.040]  be able to do work upstream and, ideally, deploy at the facilities from the binary cache,
[07:17.040 --> 07:23.720]  which I think would be way more stable and less error-prone than what we do right now.
[07:23.720 --> 07:30.200]  We set out to make this CI system to enable this with a bunch of different goals. One
[07:30.200 --> 07:34.840]  of the goals is that we want to be sustainable. We don't want to change the maintainer workflow,
[07:34.840 --> 07:39.840]  and we already have few enough maintainers for the amount of work that there is that
[07:39.840 --> 07:43.920]  we don't want to change what they have to do. They're used to going out on GitHub and
[07:43.920 --> 07:49.920]  approving PRs, getting them merged, checking if they build and so on. We don't want them
[07:49.920 --> 07:53.560]  to have to do something different, so we don't want them to both have to maintain PRs and
[07:53.560 --> 07:57.960]  think about how the integration branch is doing, like some distros do. We'd like that
[07:57.960 --> 08:03.120]  to just happen. In that vein, we want a rolling release where, on develop, we're constantly
[08:03.120 --> 08:09.200]  building binaries for the develop branch, and that we basically snapshot develop for
[08:09.200 --> 08:12.760]  every release that we do and say, okay, it's stable. Everything built. We are ready to
[08:12.760 --> 08:17.000]  do the release. We'll just cut one, and then we will backport bug fixes to this back tool
[08:17.000 --> 08:21.720]  on that release if we need to. We want it to eventually support all 6,900 packages.
[08:21.720 --> 08:25.360]  It's not something we're doing now, and we want source builds to still work with these
[08:25.360 --> 08:30.000]  binaries effectively once it's done. We want to make sure that the recipes are still versatile
[08:30.000 --> 08:34.400]  enough to do all those combinations of builds that I showed on the first slide.
[08:34.400 --> 08:39.160]  Then finally, and this is a big one, we wanted to ensure that the binaries that we have in
[08:39.160 --> 08:43.480]  SPAC are just as trustworthy as the sources. If you feel like you can trust our maintainers
[08:43.480 --> 08:47.880]  and rely on the sources that are in SPAC packages with checksums, that you feel just as comfortable
[08:47.880 --> 08:52.960]  with the binaries that we're putting in the build cache for you.
[08:52.960 --> 08:57.720]  If you think about how this works, if you look at traditional package managers, like
[08:57.720 --> 09:03.080]  say APT or YUM, you have a recipe per package configuration that's getting thrown into a
[09:03.080 --> 09:08.120]  build farm. For each of those package configurations, think like easy configs if easy build had
[09:08.120 --> 09:14.920]  binaries. Throw that into a build farm and then you get these portable unoptimized binaries
[09:14.920 --> 09:19.160]  for theoretical binary having easy build, where there's one of those per package configuration
[09:19.160 --> 09:25.120]  or per spec file or whatever. This is more like an APT or whatever. You're managing one
[09:25.120 --> 09:28.760]  software stack that's meant to be upgraded over time, and there's a consistent ABI across
[09:28.760 --> 09:33.280]  the distribution so that you can swap one package in for another. The solver in those
[09:33.280 --> 09:38.640]  distributions really operates on the binaries. In SPAC, you have parameterized package recipes
[09:38.640 --> 09:42.200]  that we are designing to be portable, and we want the maintainers to work on them together
[09:42.200 --> 09:45.840]  so that they remain portable, so you can use them in different environments. Throw those
[09:45.840 --> 09:49.640]  into the build farm and effectively test the parameterized package recipe in lots of different
[09:49.640 --> 09:54.280]  configurations and spit out different stacks. Lots of different stacks optimized for different
[09:54.280 --> 09:58.760]  environments from the same portable recipes for different systems, OSs, compilers and
[09:58.760 --> 10:03.840]  MPIs and so on. We also want, at any time, for you to be able to choose to build something
[10:03.840 --> 10:09.880]  from source along with that if you want to customize some aspect of the pipeline. To
[10:09.880 --> 10:16.200]  enable that, we came up with this architecture. We have a bunch of AWS resources, because
[10:16.200 --> 10:23.080]  AWS has been nice enough to donate some cycles to the project. They are interested in using
[10:23.080 --> 10:26.800]  SPAC in their parallel cluster product, and so that's the motivation for them is they
[10:26.800 --> 10:30.240]  want binaries ready to go if someone spins up a cluster in the cloud. They don't want
[10:30.240 --> 10:34.240]  to spin up a cluster and have it sit there and build software for hours and then run
[10:34.240 --> 10:37.640]  after having charged a bunch of money, which is nice to them, because they would make a
[10:37.640 --> 10:45.520]  lot of money. But then no one would use their service. They want binaries. In there, we
[10:45.520 --> 10:50.920]  use S3 and CloudFront to distribute the binaries around the world. EC2 is really the main build
[10:50.920 --> 10:56.080]  resource, and RDS is in there, but it's not that important. We've got a Kubernetes cluster
[10:56.080 --> 11:00.280]  in there that we have autoscaling runners in, and so we're building mostly in containers
[11:00.280 --> 11:04.480]  inside of Kube, and there's a GitLab instance in there, too. We have a high-availability
[11:04.480 --> 11:11.560]  GitLab instance. We chose GitLab because the HPC centers actually have GitLab CI themselves.
[11:11.560 --> 11:17.240]  The same CI logic that we run in the cloud, you could take that and run it internally
[11:17.240 --> 11:22.080]  and have these pipelines generated for you at your own site, too. You could slap another
[11:22.080 --> 11:26.280]  back end on this and have it generate build grass for some other system, but that's the
[11:26.280 --> 11:31.080]  one that we're using. We're using runner pools with something called Carpenter to basically
[11:31.080 --> 11:36.640]  get just-in-time instances and allocate the containers on them efficiently, and then we
[11:36.640 --> 11:40.400]  have some bare-metal runners at the University of Oregon with some fairly exotic architectures
[11:40.400 --> 11:45.400]  on them. So if we need to build or if we need to specifically run on something that has
[11:45.400 --> 11:52.280]  an AMD GPU or A64FX and so on, we can do that. And we could add more runners to this eventually.
[11:52.280 --> 11:57.920]  And there's a bot that coordinates all this work. So it's a lot of stuff. Every time I
[11:57.920 --> 12:04.880]  look at this, I am amazed at how complicated CI is and how it's one of those things that
[12:04.880 --> 12:09.680]  seems like it should just work, but there is a lot to maintaining a reliable service
[12:09.680 --> 12:14.080]  for doing this many builds. And I suspect other distro maintainers have realized that,
[12:14.080 --> 12:17.640]  too, and I'm just late to the game.
[12:17.640 --> 12:23.040]  The way that contributing a stack in SPAC works is we have this directory in the repo
[12:23.040 --> 12:28.080]  that has all of the cloud pipelines in it. And so you can see some of them are for AWS.
[12:28.080 --> 12:31.880]  Some of them are different variations on E4S. Each of those directories contains just a
[12:31.880 --> 12:38.760]  SPAC.yml that defines the stuff that is to be built. And so if you look inside of there,
[12:38.760 --> 12:44.800]  it's basically just, it's a list of packages. So here's the ML CUDA one that has the build
[12:44.800 --> 12:52.400]  of, I think, PyTorch and TensorFlow, Keras, Jax and friends for CUDA. It's just a list
[12:52.400 --> 12:56.240]  of packages plus there's a target up there, a target setting for all the packages. You
[12:56.240 --> 13:01.240]  could have a matrix of targets if you wanted. And then there's disable, rock them and enable
[13:01.240 --> 13:05.120]  CUDA on everything except for LLVM because there's that bug that's linked there. And
[13:05.120 --> 13:10.880]  I'm not entirely sure about the specifics of that. But the configuration part is up here
[13:10.880 --> 13:15.560]  and it's fairly minimal for the stack. There's currently, if you look at these, a bunch of
[13:15.560 --> 13:20.280]  other boilerplate stuff for things like mapping runners. I'll get to that in a minute. But
[13:20.280 --> 13:23.920]  this is, there's a PR that's going to go in where this is basically all that's going
[13:23.920 --> 13:28.600]  to be in your stack. And you might include some stuff from elsewhere. But this is essentially
[13:28.600 --> 13:36.400]  a stack definition. And we take that and, you know, this makes it very easy to change
[13:36.400 --> 13:41.200]  low-level parameters in the stack. So we had a working E4S stack with something like
[13:41.200 --> 13:49.600]  6 or 700 packages building. We wanted to get better testing for one API because that's
[13:49.600 --> 13:53.960]  what they're going to use on Aurora. And so we wanted to use the one API compilers.
[13:53.960 --> 13:57.760]  We added some compiler config and we said everything should use one API. And then, you know, at
[13:57.760 --> 14:02.920]  the very least we got a pipeline generated with some errors for one API. And it made
[14:02.920 --> 14:07.240]  it really easy to iterate on this with Intel where we would basically say, okay, this package
[14:07.240 --> 14:10.680]  is broken. Here's the bug. Go fix it. And then it would come back with another version
[14:10.680 --> 14:15.480]  of one API and we would iterate with them until it was done. I think this is probably
[14:15.480 --> 14:21.640]  more open-source than anyone has recently run through a vendor compiler. And so just being
[14:21.640 --> 14:26.240]  able to do this, I think, is big because it might make those compilers like actually viable
[14:26.240 --> 14:31.000]  things to use for real programs that have lots of dependencies. At the moment, you
[14:31.000 --> 14:34.040]  know, you have to sort of piece your program together and build parts of it with like,
[14:34.040 --> 14:38.880]  I don't know, PGI was the infamous one that broke on everything. But, you know, I think
[14:38.880 --> 14:43.120]  this could help with the vendor compiler being a viable second option and, you know, maybe
[14:43.120 --> 14:46.880]  instill some competition among the vendors because they can do this frequently and show,
[14:46.880 --> 14:54.080]  you know, benchmarks against these packages. So this was, I think, a win. Yeah, thank you.
[14:54.080 --> 14:59.200]  Each of those stacks gets concretized. And so people know, in SPAC, you take that abstract
[14:59.200 --> 15:02.760]  description of the things that you want to install, which is basically the requirements.
[15:02.760 --> 15:07.040]  You run it through our dependency solver. You get, essentially, a concrete description
[15:07.040 --> 15:12.520]  of what you're going to build, which is the whole concrete graph. And then we generate
[15:12.520 --> 15:17.000]  a GitLab CI YAML from that that describes the jobs that need to be run to build the
[15:17.000 --> 15:20.840]  whole thing. This is the part that we could swap out for something else. So, like, we've
[15:20.840 --> 15:24.960]  looked at, like, Tecton pipelines. We've looked at other options. I don't know, some
[15:24.960 --> 15:29.040]  people use Jenkins. There's all sorts of things out there that you could potentially map the
[15:29.040 --> 15:33.920]  jobs to. And I think we could generate a description like that from the representation
[15:33.920 --> 15:43.040]  that we have. For mapping those jobs, we have a section in the CI YAML right now, or in
[15:43.040 --> 15:50.280]  the SPAC.YAML, that basically tells you how to generate the GitLab piece. And so you
[15:50.280 --> 15:56.320]  see this mapping section here. There's a match section. If you match any of those specs there,
[15:56.320 --> 16:01.000]  and the first three are just a couple, just some names, then we have special tags that
[16:01.000 --> 16:04.840]  we put on the runners that say, you know, get me a special resource for these things.
[16:04.840 --> 16:10.480]  And so that first block is basically so that I don't run out of memory building LLVM TensorFlow
[16:10.480 --> 16:16.040]  or Torch, get me something with a lot of memory and a big CPU to build that one. It has to
[16:16.040 --> 16:21.440]  run on a big instance, because those are sort of the long poles in our tent in CI. And then
[16:21.440 --> 16:25.960]  down at the bottom, there's just a mapping from everything else gets something that supports
[16:25.960 --> 16:33.600]  X8664, V4, and it's a little smaller than the other one for builds. And you could do
[16:33.600 --> 16:38.560]  this for lots of different architecture combinations and so on. And you can ask for images and
[16:38.560 --> 16:47.120]  things like that. I said that we needed to ensure that the source is, or that the binaries
[16:47.120 --> 16:51.360]  are as reliable as the source. And so we sat down and we asked ourselves, you know, what
[16:51.360 --> 16:56.200]  is it that people trust about the SPAC project? And it's really the maintainers. If you use
[16:56.200 --> 17:00.640]  any open source project, you're trusting the maintainers, or you really shouldn't be using
[17:00.640 --> 17:05.480]  that open source project. And so I don't see where we can do better than that. And so what
[17:05.480 --> 17:11.840]  we've done is we've said the place where bad things could get into a build, at least from
[17:11.840 --> 17:17.320]  SPAC, is in the build environment. And so if you give people control of the PR environment
[17:17.320 --> 17:21.640]  where they're submitting things there, they could push a commit that puts something in
[17:21.640 --> 17:27.000]  a binary that gets cached. And then, you know, somehow, I don't know, they could do bad things
[17:27.000 --> 17:30.960]  and end up caching a binary. And if we took that binary and stuck it out there for anyone
[17:30.960 --> 17:37.160]  to use, you know, there could be bad things in it. And so we have this separate set of
[17:37.160 --> 17:43.520]  untrusted S3 buckets where we only build PR things. Each PR gets its own build cache.
[17:43.520 --> 17:47.160]  That enables the maintainers to see if things work. And then they come along and review
[17:47.160 --> 17:51.840]  the code. And then once things are actually merged to develop, we don't trust any of the
[17:51.840 --> 17:56.040]  binaries that we built on PRs. And we go and rebuild everything in sign, specifically
[17:56.040 --> 18:02.040]  from the, you know, the sources that got approved, just, you know, so that we know that we didn't
[18:02.040 --> 18:05.480]  cache anything from that environment. So that's where the development and release caches are
[18:05.480 --> 18:11.240]  coming from, where they're entirely separate from the PR environment. And the signature
[18:11.240 --> 18:14.920]  here is, you know, it's ephemeral. They have, like, a signing key locked up somewhere in
[18:14.920 --> 18:19.400]  a secret server. And we generate, you know, we have subkeys and then we generate ephemeral
[18:19.400 --> 18:24.760]  keys for the signing in the pipelines. So whatever it is that you got signed with doesn't
[18:24.760 --> 18:28.800]  actually exist anymore by the time the user consumes the binary. We could look at sick
[18:28.800 --> 18:32.640]  store for this. It wasn't quite ready for arbitrary binary signing when we did this.
[18:32.640 --> 18:39.920]  But that's an option to reduce some of the custom GPG stuff we had to do here. So the
[18:39.920 --> 18:45.360]  pull request integration, I think, makes it easy for at least for most of the contributors.
[18:45.360 --> 18:50.560]  They get status updates on PRs. And it's fairly easy for users because they can just add one
[18:50.560 --> 18:54.680]  of these binary mirrors and then start using the build cache. And I'm not going to get
[18:54.680 --> 19:01.400]  into the details here, but in SPAC, for a very long time, it was easy to get a lot of
[19:01.400 --> 19:05.600]  cache misses, like we would just look up hashes. And I have another presentation about our
[19:05.600 --> 19:10.160]  reusing Concretizer. The summary is, if you add one of these build caches and you have
[19:10.160 --> 19:15.200]  those binaries available, SPAC will prefer to use them. And so before it tries to rebuild
[19:15.200 --> 19:21.680]  something. And so with the reusing Concretizer, this is actually quite powerful. And so, yeah,
[19:21.680 --> 19:29.360]  what could go wrong? Well, there is a burden to doing this. And a build cache distribution
[19:29.360 --> 19:34.720]  like SPACnix or Geeks is different from an RPM distribution because every node has a
[19:34.720 --> 19:38.280]  hash. And the deployment model is really that you have to deploy with what you built
[19:38.280 --> 19:43.480]  with. And so you can't just swap in a new version of Zlib in a stack. If something has
[19:43.480 --> 19:46.800]  a particular hash, that implies all of its dependencies' hashes. And so you need to
[19:46.800 --> 19:52.200]  deploy the build cache with everything that it was built with. So if, for example, you
[19:52.200 --> 19:59.160]  modify XZ, right? Yep. And then you're going to need to rebuild all of these things, too.
[19:59.160 --> 20:02.200]  And you're going to need to do that all the way up to the roots of your environment every
[20:02.200 --> 20:07.160]  once in a while so that there's a consistent build cache for people to deploy. And that
[20:07.160 --> 20:11.720]  can be bad if your stack is this big. This is E4S, right? And someone comes in and submits
[20:11.720 --> 20:17.560]  a PR, which you can do, by the way, that, you know, modifies package comp. And then
[20:17.560 --> 20:22.080]  all of a sudden, you know, this is what happens to your CI system, right? Your whole graph
[20:22.080 --> 20:26.640]  is rebuilding again. And it can take a long time for develop to catch up with a change
[20:26.640 --> 20:32.240]  like this. And right now we are rebuilding all that stuff on PRs. So your pipelines can
[20:32.240 --> 20:36.160]  get long. You dig in there and you see that, like, visit is still building. And you're
[20:36.160 --> 20:40.760]  like, this is the fifth time I've built visit today. I think Harman once commented that
[20:40.760 --> 20:45.600]  he was worried that SPAC would eventually cause the heat death of the universe because
[20:45.600 --> 20:49.520]  of pair of view builds. Or no, the pair of view builds would eventually bring on climate
[20:49.520 --> 20:57.440]  change in the U.S. So we worry about that. We don't want to do that all the time.
[20:57.440 --> 21:05.000]  The other thing that can happen is there's a delicate balance between redundant builds
[21:05.000 --> 21:09.840]  and, you know, holding back PRs. I didn't think about this before we really got into
[21:09.840 --> 21:15.440]  CI. But it matters what commit you pick to merge with when you're doing a build-cache
[21:15.440 --> 21:21.000]  build. And so if you have a pipeline like this where you've built B and develop has
[21:21.000 --> 21:27.440]  now picked up on D and that one's building up there. And you get a PR like this. So PR1
[21:27.440 --> 21:32.800]  comes in. You can merge that with B and get a lot of reuse there and get a pretty good
[21:32.800 --> 21:40.840]  testing on PR1. If instead you get a PR up here that is based beyond your last developed
[21:40.840 --> 21:46.320]  build and you try to merge that with D or even C, I guess it's already based on C, so
[21:46.320 --> 21:49.720]  you can't really merge with C. But if you merge that with D, you're going to be duplicating
[21:49.720 --> 21:53.720]  the work that's already being done on develop. And so if you get a bunch of PRs like this
[21:53.720 --> 21:59.200]  at the same time, you can get a whole bunch of builds at the same time that are effectively
[21:59.200 --> 22:04.000]  already being done on develop. And so this is a difficulty of navigating these PR-based
[22:04.000 --> 22:09.440]  CI systems. If you had a server that had shared that one patch was built all the time once,
[22:09.440 --> 22:14.200]  then you could get around this. So you have to be picky about this, hold up PR2 until
[22:14.200 --> 22:19.840]  the next thing is built and then merge with that commit and send it to GitLab to be merged
[22:19.840 --> 22:25.200]  or to be built. And this can annoy contributors because they have to wait for that to happen
[22:25.200 --> 22:29.880]  for their PR in order to keep the CI system sane.
[22:29.880 --> 22:35.560]  We actually did bring down GitLab once with a bunch of PRs like this. Essentially something
[22:35.560 --> 22:39.080]  got broken in develop, develop, got held up, people started submitting a bunch of PRs,
[22:39.080 --> 22:45.360]  they were all doing redundant builds and GitLab fell over. So that was fun.
[22:45.360 --> 22:50.580]  CI does keep things stable. And so we have had, at least anecdotally, that our package
[22:50.580 --> 22:56.080]  maintainers at the lab are much more happy with how reliable their builds are for packages
[22:56.080 --> 23:02.480]  on the machines since we've had CI. But like I said, the committers get frustrated. And
[23:02.480 --> 23:05.480]  the other thing that happens here if you're doing so many builds on PRs is that if your
[23:05.480 --> 23:10.600]  CI system has occasional system errors, if you're building a thousand things on a PR
[23:10.600 --> 23:13.520]  pipeline, it's very likely that you're going to get a system error on there.
[23:13.520 --> 23:18.440]  And so what ends up happening is that you end up having to babysit PRs a bit. And that
[23:18.440 --> 23:24.880]  can be painful. The other thing that happens is it's hard to stay correct. So testing on
[23:24.880 --> 23:28.640]  PRs doesn't really ensure that you have a working develop branch. If you have a setup
[23:28.640 --> 23:35.320]  like this with an initial package state, you get a pull request at update B. You get another
[23:35.320 --> 23:39.920]  pull request in there that updates C. You test both of those configurations on your PRs
[23:39.920 --> 23:45.400]  and they work. And you merge them. The thing that you now have in develop is actually updated
[23:45.400 --> 23:51.880]  B and updated C. And you never tested that. And so keeping that state consistent is rather
[23:51.880 --> 23:56.520]  difficult. And we're thinking we're going to, we didn't, you know, before we had CI,
[23:56.520 --> 24:00.160]  I think we just didn't see these kinds of issues. They would just get manifested on
[24:00.160 --> 24:05.000]  users, which is not great. But now we run into them in CI because we can see that things
[24:05.000 --> 24:08.160]  are broken undeveloped. So we're looking into using merge queues, which actually solved
[24:08.160 --> 24:12.260]  this problem and a couple others that we have pretty effectively. So you can do faster
[24:12.260 --> 24:16.040]  iteration on PRs with merge queues because you're merging in sequence, testing in parallel.
[24:16.040 --> 24:20.520]  I'll describe what that looks like in a minute. It's a good balance of CI versus responsiveness
[24:20.520 --> 24:24.720]  because you can do sort of sparse tests on the PRs and queue them and then do the heavy
[24:24.720 --> 24:29.140]  tests. And it actually does preserve the security model because anything queued in a merge
[24:29.140 --> 24:32.480]  queue is actually approved by containers, by maintainers, and you can take the builds
[24:32.480 --> 24:37.720]  and move them straight into develop. And so what that looks like is this, where you might
[24:37.720 --> 24:42.660]  have the same initial packet state, you get two pull requests, you do some small testing
[24:42.660 --> 24:47.200]  on the pull request, and then you set up this merge queue where effectively you're doing
[24:47.200 --> 24:51.880]  heavy testing on things that are basically staged exactly as they will be merged if they
[24:51.880 --> 24:57.040]  are successful. Okay. So that gets committed, that gets committed, and now you've tested
[24:57.040 --> 25:02.360]  the final configuration on develop and you're not in an inconsistent state. So we're going
[25:02.360 --> 25:08.640]  to stage the work that we do in CI. On PRs, we're probably going to build just the package
[25:08.640 --> 25:13.040]  or just the package and its dependence, which is similar to what Nix does. On most merge
[25:13.040 --> 25:16.080]  queue pipelines, we may build a bit more than that, and then every once in a while we'll
[25:16.080 --> 25:19.960]  build everything on develop, and we'll see how it goes. We can probe, you know, what
[25:19.960 --> 25:23.680]  the balance is here. So that's where we're at. Thanks.
[25:23.680 --> 25:34.600]  Okay, I think we have time for one or two questions. Any questions for Todd?
[25:34.600 --> 25:44.400]  And off the wall question, we have, for example, software bill of materials, dev room. You
[25:44.400 --> 25:49.480]  mentioned export controlled software and also being able to trust binaries. I work with
[25:49.480 --> 25:55.000]  classified customers who have isolated networks, probably Shopify, MI6, if I told you who
[25:55.000 --> 26:01.440]  they were. But could SPAC help with providing, they're now asking for what software is running
[26:01.440 --> 26:06.680]  on these systems. I mean, what does that question mean, really? Can you help with producing
[26:06.680 --> 26:12.320]  a report on exactly what software is? Yeah, we have a PR right now for so that
[26:12.320 --> 26:17.760]  every SPAC build would produce an SBOM in some standard format. There's a whole dev
[26:17.760 --> 26:22.840]  room on SBOMs today, which gets into that. And so I think, yeah, I mean, we know everything
[26:22.840 --> 26:27.520]  in the graph, and so do Nixon Geeks and the other systems that do this. We don't expose
[26:27.520 --> 26:31.120]  it in a standard format that auditing systems can scan right now, but that's what we'd like
[26:31.120 --> 26:36.440]  to do. So very briefly, Debbie and I, a while ago,
[26:36.440 --> 26:40.200]  did something on reproducible builds, which were much more difficult. So if you haven't
[26:40.200 --> 26:43.960]  worked with her a bit, that might be interesting for you.
[26:43.960 --> 26:51.160]  Yeah, so we would like to have fully reproducible builds. It's a lot of upstream patching, right?
[26:51.160 --> 26:56.880]  And even Debian isn't fully reproducible right now. I think that that would be like something
[26:56.880 --> 27:02.080]  we could consider after we get down to, like, libc even, because at the moment, because
[27:02.080 --> 27:05.360]  we have to run on things like craze where there's so much dependent on, like, the module
[27:05.360 --> 27:10.800]  environment, we have to include the external environment to get some of these builds done.
[27:10.800 --> 27:15.000]  But yeah, I would like to have a much more isolated build environment with that. It's
[27:15.000 --> 27:19.760]  a good practice. Okay, one more question here, and then need
[27:19.760 --> 27:24.320]  to switch. Hi, so you were talking about padding your
[27:24.320 --> 27:30.400]  header files for rallies of pathing. Given that you don't have a static path or a pre-defined
[27:30.400 --> 27:36.280]  destination as in FHS-type locations, are you in serious danger of running out of space
[27:36.280 --> 27:40.040]  in that header? Well, we're not building in a static path.
[27:40.040 --> 27:45.080]  We might be building in a home directory, right? And so you can put padding in your
[27:45.080 --> 27:51.120]  install tree prefix, it's like the next store, and you can say build with 256 long paths.
[27:51.120 --> 27:55.360]  And you wouldn't want to have a user actually deploy in a path like that, but you can build
[27:55.360 --> 27:58.320]  that way, create the binary, and then redeploy in a short path.
[27:58.320 --> 28:03.880]  You've got potentially a space where there's, where you can have an arbitrary length path
[28:03.880 --> 28:07.120]  as you're running. A lot of stuff doesn't build with overly
[28:07.120 --> 28:11.800]  long paths. So if you get to 5.12, auto-tools starts breaking down and not supporting that
[28:11.800 --> 28:16.440]  length of a path, and the packages actually don't support it. And so the sweet spot seems
[28:16.440 --> 28:38.200]  to be like 256. Okay, thanks.