[00:00.000 --> 00:12.880] So next speaker is Todd Gamblin and as I think a lot of people here know I'm very much involved [00:12.880 --> 00:18.800] in the EasyBuild project which was actually the excuse we used to start the HPC Dev Room [00:18.800 --> 00:23.160] but we're also very open to other projects which are very similar to what we work on [00:23.160 --> 00:28.480] every day so in some sense back as our mortal enemy but we do allow them to give talks in [00:28.480 --> 00:36.720] the Dev Room as well. Yeah, with that, thanks. Okay, who's heard of SPAC? Okay, cool. People [00:36.720 --> 00:42.680] have heard of SPAC. We don't need to do too many introductions for this talk. This is [00:42.680 --> 00:47.560] less of a talk about SPAC and more of a talk about the CI that we've started doing since [00:47.560 --> 00:52.320] introducing binary packages in SPAC. I don't think I need to tell people why they need [00:52.320 --> 00:58.360] SPAC for HPC. I think lots of folks have talked about that already today. Harman said [00:58.360 --> 01:02.320] that, so I'm supposed to talk about a little bit about deployment. To deploy SPAC, if you [01:02.320 --> 01:05.440] want to try it on a new system, just clone it from the Git repo and run it. Like all [01:05.440 --> 01:10.040] you need is Python and a few other tools on your system to do that. So you can just run [01:10.040 --> 01:13.080] it straight out of the repo if you want to play around with it and build stuff right [01:13.080 --> 01:18.880] there. SPAC is designed to install lots of different versions of things like others [01:18.880 --> 01:23.400] have said. This is sort of a snapshot of the syntax. Some of the things that you can add, [01:23.400 --> 01:27.200] you can install HDF5 at lots of different versions, you can inject flags in the build, [01:27.200 --> 01:30.720] you can pick a compiler, you can do all that on a fly and it will build you sort of a custom [01:30.720 --> 01:34.880] version of that software and let you use it. You can get it into your environment a lot [01:34.880 --> 01:40.240] of different ways. What we're trying to do with SPAC is provide the ease of use of mainstream [01:40.240 --> 01:44.960] tools that people are used to, but with the flexibility for HPC, whether we fully accomplish [01:44.960 --> 01:49.200] that is a whole other question because there's a lot of complexity still in this because [01:49.200 --> 01:55.080] it is intended for HPC. Originally it was designed to build from source because it [01:55.080 --> 02:03.920] was trying to automate people's common workflow. The Fermi lab and CERN folks added a first [02:03.920 --> 02:08.720] implementation of binary packaging to SPAC and I talked about some of that in a past [02:08.720 --> 02:17.160] FOSDM. Since then we've actually started relying on the build caches a lot more. SPAC has relocatable [02:17.160 --> 02:22.120] build caches that you can build in either a build farm or you can make one right out [02:22.120 --> 02:28.280] of your SPAC build. You may not want to do one yourself that way because then you won't [02:28.280 --> 02:33.320] have padding on the path. Like you said, the patch off relocation is dangerous. Generally [02:33.320 --> 02:39.560] if we build binaries for wide use, we pad the paths pretty extensively so that we can just [02:39.560 --> 02:44.240] poke values in instead of having to do all the patch off stuff. Anyway, you can install [02:44.240 --> 02:49.080] SPAC binaries from a build cache in S3 to your home directory. You can make a common [02:49.080 --> 02:54.880] build cache in the file system. You can use a build cache to accelerate CI. It's very [02:54.880 --> 03:04.760] handy because it eliminates the need to rebuild lots of stuff all the time. If you look at [03:04.760 --> 03:10.880] the SPAC project as a whole, I think people know most of this. There's a community. We [03:10.880 --> 03:15.200] maintain the core tool. There's package recipes but the part that you don't see is all the [03:15.200 --> 03:20.560] infrastructure behind the scenes that keeps the thing working. Originally, we did not [03:20.560 --> 03:26.960] have CI for SPAC or at least not for the package builds. We've always had CI for the tool itself [03:26.960 --> 03:31.640] and we've done unit tests and checked a bunch of things about concretization and so on, [03:31.640 --> 03:35.440] but we weren't building all the packages. We're still not building all the packages, [03:35.440 --> 03:39.680] but we're building quite a few of them. With the infrastructure that we have, we have a [03:39.680 --> 03:43.320] system where essentially you can build lots of software stacks on top of SPAC. You can [03:43.320 --> 03:47.840] write an ML description of what you want in the software stack. You can have E4S, AWS [03:47.840 --> 03:55.080] stack, Livermore's, Math stack. There's a Viz SDK within the Exascale Computing Project. [03:55.080 --> 03:59.840] Every application is its own software stack these days. Our production codes have upwards [03:59.840 --> 04:05.520] of 100 dependencies that they are using for multi-physics. Each of them is essentially [04:05.520 --> 04:11.560] maintaining their own little private software distribution in some sense or another. We'd [04:11.560 --> 04:16.640] like to be able to build all of this stuff and ensure that these things keep working. [04:16.640 --> 04:21.880] That's hard to do given that the GitHub for SPAC is a pretty busy place. There's almost [04:21.880 --> 04:27.840] 7,000 packages in SPAC now. Over the whole life of the project, there's been over 1,100 [04:27.840 --> 04:34.280] contributors. You can see down there, this month, there's been 122 people active on the [04:34.280 --> 04:43.440] GitHub repo. Over 400 commits and 300 to 500 PRs per month that we have to merge. Ensuring [04:43.440 --> 04:48.320] that everything stays working without many changes is pretty hard and you'd be nuts [04:48.320 --> 04:56.400] to do it without CI. One of the problems that we have, though, is that CI for HPC is hard. [04:56.400 --> 05:02.040] If you want to test in the HPC environments that you actually care about, you can't just [05:02.040 --> 05:07.320] take an HPC node and hook it up to random pull requests on GitHub. They don't like that [05:07.320 --> 05:10.520] when the machine might have export-controlled software on it because you're effectively [05:10.520 --> 05:19.400] allowing some random person in a pull request to run software on your HPC machine. This [05:19.400 --> 05:22.720] is the model for SPAC. We have a bunch of external contributors on GitHub constantly [05:22.720 --> 05:28.880] contributing to this develop branch. We have stable release branches where we freeze the [05:28.880 --> 05:34.800] packages to reduce the churn that some people rely on. Most users are actually on develop, [05:34.800 --> 05:38.920] at least according to our surveys, which is a little surprising to me, but that's where [05:38.920 --> 05:40.600] we are. [05:40.600 --> 05:45.160] Then off of the release branches, there is a software distribution within the XSCL project [05:45.160 --> 05:49.880] in the U.S. called E4S. There's a few others that sort of freeze a commit from SPAC and [05:49.880 --> 05:54.560] do their own integration after that. That's really supposed to be the deployment mechanism [05:54.560 --> 06:00.560] for the 100 or so packages, and it's like 600 with dependencies, that are in ECP. What [06:00.560 --> 06:07.680] happens with this is that that gets deployed to the facilities, and the E4S team goes and [06:07.680 --> 06:13.160] ensures that everything works, but we're not able to run on these systems on pull requests [06:13.160 --> 06:18.120] in CI, and it's very frustrating. Essentially, this is a bunch of downstream work that we [06:18.120 --> 06:23.160] would really like to get rid of. Moreover, the applications are also doing downstream [06:23.160 --> 06:27.280] integration. They may have their own CI, which may be good. They're essentially pulling [06:27.280 --> 06:31.000] from all these places. They may pull a facility deployment. They may pull from develop. They [06:31.000 --> 06:34.280] may pull from a release. They're essentially integrating from all these places, and so [06:34.280 --> 06:36.400] there's a lot of downstream porting there. [06:36.400 --> 06:41.160] What we would ultimately like to do in SPAC is take all of that work that's going on downstream [06:41.160 --> 06:46.800] and move it upstream so that we're actually doing CI testing on develop along with everything [06:46.800 --> 06:50.440] else. This is progress towards that, but we're not doing that yet. Essentially, the main [06:50.440 --> 06:56.840] obstacle for us to build stuff that looks like the HPC environments right now is a [06:56.840 --> 07:02.760] licensing issue, which is that we can't take the CrayPE container and run it in the cloud, [07:02.760 --> 07:06.800] because that's just not something you can do with HPC's license. We are pushing them [07:06.800 --> 07:11.560] real hard on this and trying to get an exception for us to build things, in which case we would [07:11.560 --> 07:17.040] be able to do work upstream and, ideally, deploy at the facilities from the binary cache, [07:17.040 --> 07:23.720] which I think would be way more stable and less error-prone than what we do right now. [07:23.720 --> 07:30.200] We set out to make this CI system to enable this with a bunch of different goals. One [07:30.200 --> 07:34.840] of the goals is that we want to be sustainable. We don't want to change the maintainer workflow, [07:34.840 --> 07:39.840] and we already have few enough maintainers for the amount of work that there is that [07:39.840 --> 07:43.920] we don't want to change what they have to do. They're used to going out on GitHub and [07:43.920 --> 07:49.920] approving PRs, getting them merged, checking if they build and so on. We don't want them [07:49.920 --> 07:53.560] to have to do something different, so we don't want them to both have to maintain PRs and [07:53.560 --> 07:57.960] think about how the integration branch is doing, like some distros do. We'd like that [07:57.960 --> 08:03.120] to just happen. In that vein, we want a rolling release where, on develop, we're constantly [08:03.120 --> 08:09.200] building binaries for the develop branch, and that we basically snapshot develop for [08:09.200 --> 08:12.760] every release that we do and say, okay, it's stable. Everything built. We are ready to [08:12.760 --> 08:17.000] do the release. We'll just cut one, and then we will backport bug fixes to this back tool [08:17.000 --> 08:21.720] on that release if we need to. We want it to eventually support all 6,900 packages. [08:21.720 --> 08:25.360] It's not something we're doing now, and we want source builds to still work with these [08:25.360 --> 08:30.000] binaries effectively once it's done. We want to make sure that the recipes are still versatile [08:30.000 --> 08:34.400] enough to do all those combinations of builds that I showed on the first slide. [08:34.400 --> 08:39.160] Then finally, and this is a big one, we wanted to ensure that the binaries that we have in [08:39.160 --> 08:43.480] SPAC are just as trustworthy as the sources. If you feel like you can trust our maintainers [08:43.480 --> 08:47.880] and rely on the sources that are in SPAC packages with checksums, that you feel just as comfortable [08:47.880 --> 08:52.960] with the binaries that we're putting in the build cache for you. [08:52.960 --> 08:57.720] If you think about how this works, if you look at traditional package managers, like [08:57.720 --> 09:03.080] say APT or YUM, you have a recipe per package configuration that's getting thrown into a [09:03.080 --> 09:08.120] build farm. For each of those package configurations, think like easy configs if easy build had [09:08.120 --> 09:14.920] binaries. Throw that into a build farm and then you get these portable unoptimized binaries [09:14.920 --> 09:19.160] for theoretical binary having easy build, where there's one of those per package configuration [09:19.160 --> 09:25.120] or per spec file or whatever. This is more like an APT or whatever. You're managing one [09:25.120 --> 09:28.760] software stack that's meant to be upgraded over time, and there's a consistent ABI across [09:28.760 --> 09:33.280] the distribution so that you can swap one package in for another. The solver in those [09:33.280 --> 09:38.640] distributions really operates on the binaries. In SPAC, you have parameterized package recipes [09:38.640 --> 09:42.200] that we are designing to be portable, and we want the maintainers to work on them together [09:42.200 --> 09:45.840] so that they remain portable, so you can use them in different environments. Throw those [09:45.840 --> 09:49.640] into the build farm and effectively test the parameterized package recipe in lots of different [09:49.640 --> 09:54.280] configurations and spit out different stacks. Lots of different stacks optimized for different [09:54.280 --> 09:58.760] environments from the same portable recipes for different systems, OSs, compilers and [09:58.760 --> 10:03.840] MPIs and so on. We also want, at any time, for you to be able to choose to build something [10:03.840 --> 10:09.880] from source along with that if you want to customize some aspect of the pipeline. To [10:09.880 --> 10:16.200] enable that, we came up with this architecture. We have a bunch of AWS resources, because [10:16.200 --> 10:23.080] AWS has been nice enough to donate some cycles to the project. They are interested in using [10:23.080 --> 10:26.800] SPAC in their parallel cluster product, and so that's the motivation for them is they [10:26.800 --> 10:30.240] want binaries ready to go if someone spins up a cluster in the cloud. They don't want [10:30.240 --> 10:34.240] to spin up a cluster and have it sit there and build software for hours and then run [10:34.240 --> 10:37.640] after having charged a bunch of money, which is nice to them, because they would make a [10:37.640 --> 10:45.520] lot of money. But then no one would use their service. They want binaries. In there, we [10:45.520 --> 10:50.920] use S3 and CloudFront to distribute the binaries around the world. EC2 is really the main build [10:50.920 --> 10:56.080] resource, and RDS is in there, but it's not that important. We've got a Kubernetes cluster [10:56.080 --> 11:00.280] in there that we have autoscaling runners in, and so we're building mostly in containers [11:00.280 --> 11:04.480] inside of Kube, and there's a GitLab instance in there, too. We have a high-availability [11:04.480 --> 11:11.560] GitLab instance. We chose GitLab because the HPC centers actually have GitLab CI themselves. [11:11.560 --> 11:17.240] The same CI logic that we run in the cloud, you could take that and run it internally [11:17.240 --> 11:22.080] and have these pipelines generated for you at your own site, too. You could slap another [11:22.080 --> 11:26.280] back end on this and have it generate build grass for some other system, but that's the [11:26.280 --> 11:31.080] one that we're using. We're using runner pools with something called Carpenter to basically [11:31.080 --> 11:36.640] get just-in-time instances and allocate the containers on them efficiently, and then we [11:36.640 --> 11:40.400] have some bare-metal runners at the University of Oregon with some fairly exotic architectures [11:40.400 --> 11:45.400] on them. So if we need to build or if we need to specifically run on something that has [11:45.400 --> 11:52.280] an AMD GPU or A64FX and so on, we can do that. And we could add more runners to this eventually. [11:52.280 --> 11:57.920] And there's a bot that coordinates all this work. So it's a lot of stuff. Every time I [11:57.920 --> 12:04.880] look at this, I am amazed at how complicated CI is and how it's one of those things that [12:04.880 --> 12:09.680] seems like it should just work, but there is a lot to maintaining a reliable service [12:09.680 --> 12:14.080] for doing this many builds. And I suspect other distro maintainers have realized that, [12:14.080 --> 12:17.640] too, and I'm just late to the game. [12:17.640 --> 12:23.040] The way that contributing a stack in SPAC works is we have this directory in the repo [12:23.040 --> 12:28.080] that has all of the cloud pipelines in it. And so you can see some of them are for AWS. [12:28.080 --> 12:31.880] Some of them are different variations on E4S. Each of those directories contains just a [12:31.880 --> 12:38.760] SPAC.yml that defines the stuff that is to be built. And so if you look inside of there, [12:38.760 --> 12:44.800] it's basically just, it's a list of packages. So here's the ML CUDA one that has the build [12:44.800 --> 12:52.400] of, I think, PyTorch and TensorFlow, Keras, Jax and friends for CUDA. It's just a list [12:52.400 --> 12:56.240] of packages plus there's a target up there, a target setting for all the packages. You [12:56.240 --> 13:01.240] could have a matrix of targets if you wanted. And then there's disable, rock them and enable [13:01.240 --> 13:05.120] CUDA on everything except for LLVM because there's that bug that's linked there. And [13:05.120 --> 13:10.880] I'm not entirely sure about the specifics of that. But the configuration part is up here [13:10.880 --> 13:15.560] and it's fairly minimal for the stack. There's currently, if you look at these, a bunch of [13:15.560 --> 13:20.280] other boilerplate stuff for things like mapping runners. I'll get to that in a minute. But [13:20.280 --> 13:23.920] this is, there's a PR that's going to go in where this is basically all that's going [13:23.920 --> 13:28.600] to be in your stack. And you might include some stuff from elsewhere. But this is essentially [13:28.600 --> 13:36.400] a stack definition. And we take that and, you know, this makes it very easy to change [13:36.400 --> 13:41.200] low-level parameters in the stack. So we had a working E4S stack with something like [13:41.200 --> 13:49.600] 6 or 700 packages building. We wanted to get better testing for one API because that's [13:49.600 --> 13:53.960] what they're going to use on Aurora. And so we wanted to use the one API compilers. [13:53.960 --> 13:57.760] We added some compiler config and we said everything should use one API. And then, you know, at [13:57.760 --> 14:02.920] the very least we got a pipeline generated with some errors for one API. And it made [14:02.920 --> 14:07.240] it really easy to iterate on this with Intel where we would basically say, okay, this package [14:07.240 --> 14:10.680] is broken. Here's the bug. Go fix it. And then it would come back with another version [14:10.680 --> 14:15.480] of one API and we would iterate with them until it was done. I think this is probably [14:15.480 --> 14:21.640] more open-source than anyone has recently run through a vendor compiler. And so just being [14:21.640 --> 14:26.240] able to do this, I think, is big because it might make those compilers like actually viable [14:26.240 --> 14:31.000] things to use for real programs that have lots of dependencies. At the moment, you [14:31.000 --> 14:34.040] know, you have to sort of piece your program together and build parts of it with like, [14:34.040 --> 14:38.880] I don't know, PGI was the infamous one that broke on everything. But, you know, I think [14:38.880 --> 14:43.120] this could help with the vendor compiler being a viable second option and, you know, maybe [14:43.120 --> 14:46.880] instill some competition among the vendors because they can do this frequently and show, [14:46.880 --> 14:54.080] you know, benchmarks against these packages. So this was, I think, a win. Yeah, thank you. [14:54.080 --> 14:59.200] Each of those stacks gets concretized. And so people know, in SPAC, you take that abstract [14:59.200 --> 15:02.760] description of the things that you want to install, which is basically the requirements. [15:02.760 --> 15:07.040] You run it through our dependency solver. You get, essentially, a concrete description [15:07.040 --> 15:12.520] of what you're going to build, which is the whole concrete graph. And then we generate [15:12.520 --> 15:17.000] a GitLab CI YAML from that that describes the jobs that need to be run to build the [15:17.000 --> 15:20.840] whole thing. This is the part that we could swap out for something else. So, like, we've [15:20.840 --> 15:24.960] looked at, like, Tecton pipelines. We've looked at other options. I don't know, some [15:24.960 --> 15:29.040] people use Jenkins. There's all sorts of things out there that you could potentially map the [15:29.040 --> 15:33.920] jobs to. And I think we could generate a description like that from the representation [15:33.920 --> 15:43.040] that we have. For mapping those jobs, we have a section in the CI YAML right now, or in [15:43.040 --> 15:50.280] the SPAC.YAML, that basically tells you how to generate the GitLab piece. And so you [15:50.280 --> 15:56.320] see this mapping section here. There's a match section. If you match any of those specs there, [15:56.320 --> 16:01.000] and the first three are just a couple, just some names, then we have special tags that [16:01.000 --> 16:04.840] we put on the runners that say, you know, get me a special resource for these things. [16:04.840 --> 16:10.480] And so that first block is basically so that I don't run out of memory building LLVM TensorFlow [16:10.480 --> 16:16.040] or Torch, get me something with a lot of memory and a big CPU to build that one. It has to [16:16.040 --> 16:21.440] run on a big instance, because those are sort of the long poles in our tent in CI. And then [16:21.440 --> 16:25.960] down at the bottom, there's just a mapping from everything else gets something that supports [16:25.960 --> 16:33.600] X8664, V4, and it's a little smaller than the other one for builds. And you could do [16:33.600 --> 16:38.560] this for lots of different architecture combinations and so on. And you can ask for images and [16:38.560 --> 16:47.120] things like that. I said that we needed to ensure that the source is, or that the binaries [16:47.120 --> 16:51.360] are as reliable as the source. And so we sat down and we asked ourselves, you know, what [16:51.360 --> 16:56.200] is it that people trust about the SPAC project? And it's really the maintainers. If you use [16:56.200 --> 17:00.640] any open source project, you're trusting the maintainers, or you really shouldn't be using [17:00.640 --> 17:05.480] that open source project. And so I don't see where we can do better than that. And so what [17:05.480 --> 17:11.840] we've done is we've said the place where bad things could get into a build, at least from [17:11.840 --> 17:17.320] SPAC, is in the build environment. And so if you give people control of the PR environment [17:17.320 --> 17:21.640] where they're submitting things there, they could push a commit that puts something in [17:21.640 --> 17:27.000] a binary that gets cached. And then, you know, somehow, I don't know, they could do bad things [17:27.000 --> 17:30.960] and end up caching a binary. And if we took that binary and stuck it out there for anyone [17:30.960 --> 17:37.160] to use, you know, there could be bad things in it. And so we have this separate set of [17:37.160 --> 17:43.520] untrusted S3 buckets where we only build PR things. Each PR gets its own build cache. [17:43.520 --> 17:47.160] That enables the maintainers to see if things work. And then they come along and review [17:47.160 --> 17:51.840] the code. And then once things are actually merged to develop, we don't trust any of the [17:51.840 --> 17:56.040] binaries that we built on PRs. And we go and rebuild everything in sign, specifically [17:56.040 --> 18:02.040] from the, you know, the sources that got approved, just, you know, so that we know that we didn't [18:02.040 --> 18:05.480] cache anything from that environment. So that's where the development and release caches are [18:05.480 --> 18:11.240] coming from, where they're entirely separate from the PR environment. And the signature [18:11.240 --> 18:14.920] here is, you know, it's ephemeral. They have, like, a signing key locked up somewhere in [18:14.920 --> 18:19.400] a secret server. And we generate, you know, we have subkeys and then we generate ephemeral [18:19.400 --> 18:24.760] keys for the signing in the pipelines. So whatever it is that you got signed with doesn't [18:24.760 --> 18:28.800] actually exist anymore by the time the user consumes the binary. We could look at sick [18:28.800 --> 18:32.640] store for this. It wasn't quite ready for arbitrary binary signing when we did this. [18:32.640 --> 18:39.920] But that's an option to reduce some of the custom GPG stuff we had to do here. So the [18:39.920 --> 18:45.360] pull request integration, I think, makes it easy for at least for most of the contributors. [18:45.360 --> 18:50.560] They get status updates on PRs. And it's fairly easy for users because they can just add one [18:50.560 --> 18:54.680] of these binary mirrors and then start using the build cache. And I'm not going to get [18:54.680 --> 19:01.400] into the details here, but in SPAC, for a very long time, it was easy to get a lot of [19:01.400 --> 19:05.600] cache misses, like we would just look up hashes. And I have another presentation about our [19:05.600 --> 19:10.160] reusing Concretizer. The summary is, if you add one of these build caches and you have [19:10.160 --> 19:15.200] those binaries available, SPAC will prefer to use them. And so before it tries to rebuild [19:15.200 --> 19:21.680] something. And so with the reusing Concretizer, this is actually quite powerful. And so, yeah, [19:21.680 --> 19:29.360] what could go wrong? Well, there is a burden to doing this. And a build cache distribution [19:29.360 --> 19:34.720] like SPACnix or Geeks is different from an RPM distribution because every node has a [19:34.720 --> 19:38.280] hash. And the deployment model is really that you have to deploy with what you built [19:38.280 --> 19:43.480] with. And so you can't just swap in a new version of Zlib in a stack. If something has [19:43.480 --> 19:46.800] a particular hash, that implies all of its dependencies' hashes. And so you need to [19:46.800 --> 19:52.200] deploy the build cache with everything that it was built with. So if, for example, you [19:52.200 --> 19:59.160] modify XZ, right? Yep. And then you're going to need to rebuild all of these things, too. [19:59.160 --> 20:02.200] And you're going to need to do that all the way up to the roots of your environment every [20:02.200 --> 20:07.160] once in a while so that there's a consistent build cache for people to deploy. And that [20:07.160 --> 20:11.720] can be bad if your stack is this big. This is E4S, right? And someone comes in and submits [20:11.720 --> 20:17.560] a PR, which you can do, by the way, that, you know, modifies package comp. And then [20:17.560 --> 20:22.080] all of a sudden, you know, this is what happens to your CI system, right? Your whole graph [20:22.080 --> 20:26.640] is rebuilding again. And it can take a long time for develop to catch up with a change [20:26.640 --> 20:32.240] like this. And right now we are rebuilding all that stuff on PRs. So your pipelines can [20:32.240 --> 20:36.160] get long. You dig in there and you see that, like, visit is still building. And you're [20:36.160 --> 20:40.760] like, this is the fifth time I've built visit today. I think Harman once commented that [20:40.760 --> 20:45.600] he was worried that SPAC would eventually cause the heat death of the universe because [20:45.600 --> 20:49.520] of pair of view builds. Or no, the pair of view builds would eventually bring on climate [20:49.520 --> 20:57.440] change in the U.S. So we worry about that. We don't want to do that all the time. [20:57.440 --> 21:05.000] The other thing that can happen is there's a delicate balance between redundant builds [21:05.000 --> 21:09.840] and, you know, holding back PRs. I didn't think about this before we really got into [21:09.840 --> 21:15.440] CI. But it matters what commit you pick to merge with when you're doing a build-cache [21:15.440 --> 21:21.000] build. And so if you have a pipeline like this where you've built B and develop has [21:21.000 --> 21:27.440] now picked up on D and that one's building up there. And you get a PR like this. So PR1 [21:27.440 --> 21:32.800] comes in. You can merge that with B and get a lot of reuse there and get a pretty good [21:32.800 --> 21:40.840] testing on PR1. If instead you get a PR up here that is based beyond your last developed [21:40.840 --> 21:46.320] build and you try to merge that with D or even C, I guess it's already based on C, so [21:46.320 --> 21:49.720] you can't really merge with C. But if you merge that with D, you're going to be duplicating [21:49.720 --> 21:53.720] the work that's already being done on develop. And so if you get a bunch of PRs like this [21:53.720 --> 21:59.200] at the same time, you can get a whole bunch of builds at the same time that are effectively [21:59.200 --> 22:04.000] already being done on develop. And so this is a difficulty of navigating these PR-based [22:04.000 --> 22:09.440] CI systems. If you had a server that had shared that one patch was built all the time once, [22:09.440 --> 22:14.200] then you could get around this. So you have to be picky about this, hold up PR2 until [22:14.200 --> 22:19.840] the next thing is built and then merge with that commit and send it to GitLab to be merged [22:19.840 --> 22:25.200] or to be built. And this can annoy contributors because they have to wait for that to happen [22:25.200 --> 22:29.880] for their PR in order to keep the CI system sane. [22:29.880 --> 22:35.560] We actually did bring down GitLab once with a bunch of PRs like this. Essentially something [22:35.560 --> 22:39.080] got broken in develop, develop, got held up, people started submitting a bunch of PRs, [22:39.080 --> 22:45.360] they were all doing redundant builds and GitLab fell over. So that was fun. [22:45.360 --> 22:50.580] CI does keep things stable. And so we have had, at least anecdotally, that our package [22:50.580 --> 22:56.080] maintainers at the lab are much more happy with how reliable their builds are for packages [22:56.080 --> 23:02.480] on the machines since we've had CI. But like I said, the committers get frustrated. And [23:02.480 --> 23:05.480] the other thing that happens here if you're doing so many builds on PRs is that if your [23:05.480 --> 23:10.600] CI system has occasional system errors, if you're building a thousand things on a PR [23:10.600 --> 23:13.520] pipeline, it's very likely that you're going to get a system error on there. [23:13.520 --> 23:18.440] And so what ends up happening is that you end up having to babysit PRs a bit. And that [23:18.440 --> 23:24.880] can be painful. The other thing that happens is it's hard to stay correct. So testing on [23:24.880 --> 23:28.640] PRs doesn't really ensure that you have a working develop branch. If you have a setup [23:28.640 --> 23:35.320] like this with an initial package state, you get a pull request at update B. You get another [23:35.320 --> 23:39.920] pull request in there that updates C. You test both of those configurations on your PRs [23:39.920 --> 23:45.400] and they work. And you merge them. The thing that you now have in develop is actually updated [23:45.400 --> 23:51.880] B and updated C. And you never tested that. And so keeping that state consistent is rather [23:51.880 --> 23:56.520] difficult. And we're thinking we're going to, we didn't, you know, before we had CI, [23:56.520 --> 24:00.160] I think we just didn't see these kinds of issues. They would just get manifested on [24:00.160 --> 24:05.000] users, which is not great. But now we run into them in CI because we can see that things [24:05.000 --> 24:08.160] are broken undeveloped. So we're looking into using merge queues, which actually solved [24:08.160 --> 24:12.260] this problem and a couple others that we have pretty effectively. So you can do faster [24:12.260 --> 24:16.040] iteration on PRs with merge queues because you're merging in sequence, testing in parallel. [24:16.040 --> 24:20.520] I'll describe what that looks like in a minute. It's a good balance of CI versus responsiveness [24:20.520 --> 24:24.720] because you can do sort of sparse tests on the PRs and queue them and then do the heavy [24:24.720 --> 24:29.140] tests. And it actually does preserve the security model because anything queued in a merge [24:29.140 --> 24:32.480] queue is actually approved by containers, by maintainers, and you can take the builds [24:32.480 --> 24:37.720] and move them straight into develop. And so what that looks like is this, where you might [24:37.720 --> 24:42.660] have the same initial packet state, you get two pull requests, you do some small testing [24:42.660 --> 24:47.200] on the pull request, and then you set up this merge queue where effectively you're doing [24:47.200 --> 24:51.880] heavy testing on things that are basically staged exactly as they will be merged if they [24:51.880 --> 24:57.040] are successful. Okay. So that gets committed, that gets committed, and now you've tested [24:57.040 --> 25:02.360] the final configuration on develop and you're not in an inconsistent state. So we're going [25:02.360 --> 25:08.640] to stage the work that we do in CI. On PRs, we're probably going to build just the package [25:08.640 --> 25:13.040] or just the package and its dependence, which is similar to what Nix does. On most merge [25:13.040 --> 25:16.080] queue pipelines, we may build a bit more than that, and then every once in a while we'll [25:16.080 --> 25:19.960] build everything on develop, and we'll see how it goes. We can probe, you know, what [25:19.960 --> 25:23.680] the balance is here. So that's where we're at. Thanks. [25:23.680 --> 25:34.600] Okay, I think we have time for one or two questions. Any questions for Todd? [25:34.600 --> 25:44.400] And off the wall question, we have, for example, software bill of materials, dev room. You [25:44.400 --> 25:49.480] mentioned export controlled software and also being able to trust binaries. I work with [25:49.480 --> 25:55.000] classified customers who have isolated networks, probably Shopify, MI6, if I told you who [25:55.000 --> 26:01.440] they were. But could SPAC help with providing, they're now asking for what software is running [26:01.440 --> 26:06.680] on these systems. I mean, what does that question mean, really? Can you help with producing [26:06.680 --> 26:12.320] a report on exactly what software is? Yeah, we have a PR right now for so that [26:12.320 --> 26:17.760] every SPAC build would produce an SBOM in some standard format. There's a whole dev [26:17.760 --> 26:22.840] room on SBOMs today, which gets into that. And so I think, yeah, I mean, we know everything [26:22.840 --> 26:27.520] in the graph, and so do Nixon Geeks and the other systems that do this. We don't expose [26:27.520 --> 26:31.120] it in a standard format that auditing systems can scan right now, but that's what we'd like [26:31.120 --> 26:36.440] to do. So very briefly, Debbie and I, a while ago, [26:36.440 --> 26:40.200] did something on reproducible builds, which were much more difficult. So if you haven't [26:40.200 --> 26:43.960] worked with her a bit, that might be interesting for you. [26:43.960 --> 26:51.160] Yeah, so we would like to have fully reproducible builds. It's a lot of upstream patching, right? [26:51.160 --> 26:56.880] And even Debian isn't fully reproducible right now. I think that that would be like something [26:56.880 --> 27:02.080] we could consider after we get down to, like, libc even, because at the moment, because [27:02.080 --> 27:05.360] we have to run on things like craze where there's so much dependent on, like, the module [27:05.360 --> 27:10.800] environment, we have to include the external environment to get some of these builds done. [27:10.800 --> 27:15.000] But yeah, I would like to have a much more isolated build environment with that. It's [27:15.000 --> 27:19.760] a good practice. Okay, one more question here, and then need [27:19.760 --> 27:24.320] to switch. Hi, so you were talking about padding your [27:24.320 --> 27:30.400] header files for rallies of pathing. Given that you don't have a static path or a pre-defined [27:30.400 --> 27:36.280] destination as in FHS-type locations, are you in serious danger of running out of space [27:36.280 --> 27:40.040] in that header? Well, we're not building in a static path. [27:40.040 --> 27:45.080] We might be building in a home directory, right? And so you can put padding in your [27:45.080 --> 27:51.120] install tree prefix, it's like the next store, and you can say build with 256 long paths. [27:51.120 --> 27:55.360] And you wouldn't want to have a user actually deploy in a path like that, but you can build [27:55.360 --> 27:58.320] that way, create the binary, and then redeploy in a short path. [27:58.320 --> 28:03.880] You've got potentially a space where there's, where you can have an arbitrary length path [28:03.880 --> 28:07.120] as you're running. A lot of stuff doesn't build with overly [28:07.120 --> 28:11.800] long paths. So if you get to 5.12, auto-tools starts breaking down and not supporting that [28:11.800 --> 28:16.440] length of a path, and the packages actually don't support it. And so the sweet spot seems [28:16.440 --> 28:38.200] to be like 256. Okay, thanks.