[00:00.000 --> 00:08.440] All right, so I'm Todd Gamble, and I'm from Lawrence Livermore National Laboratory. [00:08.440 --> 00:13.160] Normally I would give an intro of what Livermore is, but who's been hearing about Livermore [00:13.160 --> 00:14.880] in the news lately? [00:14.880 --> 00:18.720] The people heard about the fusion ignition over in the US, that's our lab. [00:18.720 --> 00:21.080] So I'm from there. [00:21.080 --> 00:26.360] I work in the HPC area at Livermore, and so we have a big supercomputing center. [00:26.360 --> 00:30.500] And the HPC ecosystem is a pretty complex place. [00:30.500 --> 00:33.200] People distribute software, mostly as source. [00:33.200 --> 00:36.480] You build lots of different variants of the package. [00:36.480 --> 00:39.680] Users typically don't have root on the machine when they install software, and so they're [00:39.680 --> 00:43.720] building from source in their home directory or installing something in their home directory. [00:43.720 --> 00:47.760] And you want the code to be optimized for fancy machines like these ones over here. [00:47.760 --> 00:53.000] So you're trying to build software that supports a really broad set of environments, including [00:53.000 --> 00:58.560] like Power, ARM, AMD, Intel, and then also GPU architectures. [00:58.560 --> 01:03.520] So things like NVIDIA and now AMD GPUs are showing up, and we've even got a machine coming [01:03.520 --> 01:05.520] all out at Argonne. [01:05.520 --> 01:09.840] This is near Chicago with Intel Panaveco GPUs. [01:09.840 --> 01:15.680] On top of all that, the ecosystem has C, C++, Fortran, Python, other languages, Lua, all [01:15.680 --> 01:17.600] linked together in the same app. [01:17.600 --> 01:21.620] And so we want a distribution that can support this type of environment. [01:21.620 --> 01:25.800] And so SPAC is a package manager that enables software distribution for HPC, given that [01:25.800 --> 01:28.600] set of constraints. [01:28.600 --> 01:32.640] Packages are not quite like the build specs that you would see in your standard RPM or [01:32.640 --> 01:33.920] Deb-based distribution. [01:33.920 --> 01:38.760] They're really parameterized Python recipes for how to build that package on lots of different [01:38.760 --> 01:39.760] architectures. [01:39.760 --> 01:41.080] And it has a DSL for doing that. [01:41.080 --> 01:43.120] I'm not going to get into that today. [01:43.120 --> 01:47.800] But the end user can essentially take one package and install it lots of different ways. [01:47.800 --> 01:51.320] So you could say, I want to install HDF5 at a particular version. [01:51.320 --> 01:54.320] I want to install it with Clang, not GCC. [01:54.320 --> 01:59.080] I want to have the thread safe option on, or I want to inject some flags in the build [01:59.080 --> 02:02.920] and have an entirely different version of it that's built with a different set of flags, [02:02.920 --> 02:07.640] or that's targeted at a particular micro-architecture, or that maybe uses a particular dependency. [02:07.640 --> 02:12.200] So you can build the same package with two versions of MPI. [02:12.200 --> 02:16.720] So we're trying to provide the ease of use of mainstream tools with the flexibility needed [02:16.720 --> 02:19.680] for HPC so that we can get the performance everyone. [02:19.680 --> 02:25.640] And it builds from source, but you can also install relocatable build caches in SPAC, much [02:25.640 --> 02:29.360] like you would with, say, Nix or Geeks. [02:29.360 --> 02:31.920] They're not relocatable because they're not really targeting the sort of home directory [02:31.920 --> 02:35.000] use case, but it's the same sort of build cache model. [02:35.000 --> 02:37.800] It's not a typical binary distribution. [02:37.800 --> 02:41.520] The whole project has a fairly large community of contributors, or at least maybe not large [02:41.520 --> 02:46.260] by some of the other distribution standards, but we have 1100-plus contributors. [02:46.260 --> 02:49.560] We maintain the core tool, and then there's a whole bunch of people who work on package [02:49.560 --> 02:50.560] recipes. [02:50.560 --> 02:53.640] So in some ways, it looks a lot like Homebrew or a project like that. [02:53.640 --> 02:57.360] And then there's a whole bunch of infrastructure behind the scenes to keep all this working, [02:57.360 --> 03:00.760] and all these things together enable people to build lots of different software stacks. [03:00.760 --> 03:05.840] And so there's like an extreme-scale software stack that's maintained by the US Exascale [03:05.840 --> 03:06.840] Project. [03:06.840 --> 03:12.440] AWS has a stack that they use on their parallel cluster product internally, and also for users. [03:12.440 --> 03:14.280] Livermore has its internal software deployment. [03:14.280 --> 03:18.480] There are some math library stacks, VizTools, things like that. [03:18.480 --> 03:21.600] And every application, really, in HPC is kind of its own software stack. [03:21.600 --> 03:29.560] So you heard about flat packs and snaps in the last session, well, really, making apps [03:29.560 --> 03:34.640] more mindful of how their software is actually a distribution is something that we've been [03:34.640 --> 03:38.960] pushing for a long time within HPC. [03:38.960 --> 03:40.920] The GitHub is a pretty busy place. [03:40.920 --> 03:46.560] We merge 300 to 500 PRs per month, and it's like something like 411 commits or more. [03:46.560 --> 03:51.040] And so managing that is kind of painful. [03:51.040 --> 03:56.520] And we're trying very hard to reduce downstream work, which is actually difficult for a source-based [03:56.520 --> 03:58.240] distribution. [03:58.240 --> 04:01.600] If you think about how SPAC is structured, there's this mainline develop branch that [04:01.600 --> 04:02.600] actually most people use. [04:02.600 --> 04:05.400] They'll just clone it straight from the repo, build from that, kind of like you do with [04:05.400 --> 04:07.600] mixed packages or something. [04:07.600 --> 04:09.640] External contributors contribute there. [04:09.640 --> 04:14.080] And we cut a release every once in a while where we stabilize the packages and keep them [04:14.080 --> 04:19.560] sort of fixed so that you don't have a lot of version churn in the repo. [04:19.560 --> 04:23.560] And then to actually integrate with the HPC facilities, all the places that are deploying [04:23.560 --> 04:29.640] supercomputers, we have this E4S software distribution where they end up doing a whole [04:29.640 --> 04:33.120] bunch of downstream integration at the site, where they're basically building the whole [04:33.120 --> 04:36.520] thing from source, essentially in a new environment. [04:36.520 --> 04:39.680] And there's a whole lot of debugging that takes place there that we would really like [04:39.680 --> 04:42.200] to be able to move upstream. [04:42.200 --> 04:46.600] The applications, likewise, they are not necessarily using what the facility deploys. [04:46.600 --> 04:47.600] Some of them do. [04:47.600 --> 04:48.680] Some of them don't. [04:48.680 --> 04:50.520] They pull from basically all of these places. [04:50.520 --> 04:54.080] They might get a math solver library from the facility. [04:54.080 --> 04:58.760] They might get something else installed from SPAC mainline built the way that they want. [04:58.760 --> 05:02.960] And they may pull stuff off of release branches too, all to assemble an application and have [05:02.960 --> 05:04.200] it built. [05:04.200 --> 05:09.280] And so this is a lot of porting at the lowest end, and what we'd really like to do is take [05:09.280 --> 05:16.000] that software integration and move it upstream and get to a point where we can have these [05:16.000 --> 05:22.160] types of environments building NCI all the time in sort of a rolling release and do binary [05:22.160 --> 05:25.920] deploys on the supercomputers with actual optimized binaries. [05:25.920 --> 05:27.280] So that's what we're trying to get to. [05:27.280 --> 05:30.920] So we set out to make a binary distribution with a bunch of different goals. [05:30.920 --> 05:37.040] The main one, and the one that's pretty key to our whole ecosystem, is it has to be sustainable. [05:37.040 --> 05:39.360] We don't have that many maintainers. [05:39.360 --> 05:43.760] And they currently, their workflow is basically to work with people who are making contributions, [05:43.760 --> 05:47.280] on pull requests, help them get them merged, and then move on to the next one. [05:47.280 --> 05:51.560] And we don't want them to have to sit around and babysit builds on, say, a release integration [05:51.560 --> 05:54.080] branch all the time. [05:54.080 --> 05:57.960] We want a rolling release because people do tend to use the develop branch. [05:57.960 --> 06:01.880] And so we want that to be up to date with pretty current binaries all the time. [06:01.880 --> 06:06.440] But some people do fix themselves to releases, and so we want sort of snapshots for those [06:06.440 --> 06:08.040] releases as well. [06:08.040 --> 06:12.360] We need to be able to support, at least eventually, all the packages that are in SPAC. [06:12.360 --> 06:15.320] And it still has to be source-buildable around those binaries. [06:15.320 --> 06:19.160] So if you want to build a component and rely on binaries for some other component, we want [06:19.160 --> 06:20.520] to support that. [06:20.520 --> 06:23.600] And then finally, people trust sources. [06:23.600 --> 06:24.600] They can check some of them. [06:24.600 --> 06:25.600] You can download the tarball. [06:25.600 --> 06:29.520] You can usually check some of them, except for when GitHub changes the hashes. [06:29.520 --> 06:34.160] But we want to ensure that the binaries that we're generating are just as trustworthy as [06:34.160 --> 06:35.160] the sources. [06:35.160 --> 06:39.240] So we've taken some steps to ensure that. [06:39.240 --> 06:44.160] So SPAC is a little different from your standard distro if you haven't gathered already. [06:44.160 --> 06:49.400] If you think about a traditional package manager, you have a sort of a recipe per configuration. [06:49.400 --> 06:52.240] And so that's like your RPM build spec or dev spec or whatever. [06:52.240 --> 06:56.040] It goes into a build farm, and you produce packages, at least for one platform, in sort [06:56.040 --> 07:00.600] of a one-to-one relationship with those specs, actually. [07:00.600 --> 07:03.640] There's templating and things that goes on to reduce that. [07:03.640 --> 07:07.800] But you're typically maintaining one software stack that gets updated over time. [07:07.800 --> 07:11.840] In SPAC, what we're trying to do is we have these parameterized package recipes that go [07:11.840 --> 07:15.200] into build farm, but it's really the same recipe that's being used across different [07:15.200 --> 07:16.200] architectures. [07:16.200 --> 07:19.880] We force the contributors to work on the same package so that essentially you're modeling [07:19.880 --> 07:23.200] all the different ways the software can be used, and we try to get a lot of reuse out [07:23.200 --> 07:25.560] of the recipes across platforms. [07:25.560 --> 07:28.920] Those go into the build farm, and you can use the same recipes to produce optimized binaries [07:28.920 --> 07:29.920] for lots of different platforms. [07:29.920 --> 07:34.040] So you could get a graviton, arm build, you could get a Skylake binary, you could get [07:34.040 --> 07:35.920] a GPU build, and so on. [07:35.920 --> 07:40.240] And then you could do that for many different software stacks for different use cases. [07:40.240 --> 07:43.320] And then we want you to be able to build from source on top of that. [07:43.320 --> 07:45.440] So that's what we're trying to do. [07:45.440 --> 07:50.520] We put a CI architecture together that is sort of based around this. [07:50.520 --> 07:53.280] Like I said, we want to be sustainable, we want to maintain the workflow that we already [07:53.280 --> 07:56.680] have on the project, and so we want people, we want basically GitHub to be the center [07:56.680 --> 07:58.280] of the distribution. [07:58.280 --> 08:01.760] What goes into develop is really maintaining the distribution as well as contributing to [08:01.760 --> 08:02.880] the project. [08:02.880 --> 08:07.960] And so we have a bunch of infrastructure currently stood up in AWS to support this. [08:07.960 --> 08:12.360] So the binaries themselves and the sources are all distributed through S3 and CloudFront. [08:12.360 --> 08:18.520] We set up a big Kubernetes cluster to support autoscaling runners, and we're using high [08:18.520 --> 08:22.800] availability GitLab in there to drive the CI. [08:22.800 --> 08:26.000] GitLab may seem like a strange choice for maintaining a distribution, but the motivation [08:26.000 --> 08:30.320] behind that is really that all of the HPC centers also have internal GitLabs, and so [08:30.320 --> 08:32.480] do a lot of universities and other sites. [08:32.480 --> 08:37.360] And so the goal is really for all of this automation and tooling to be usable not just [08:37.360 --> 08:41.840] in the cloud for the large distribution of SPAC, but also for people's personal software [08:41.840 --> 08:43.280] stacks locally. [08:43.280 --> 08:48.560] And so the idea is that we're generating GitLab CI configuration, and you can use that either [08:48.560 --> 08:54.320] for this or internally or in an air gap network somewhere. [08:54.320 --> 08:59.760] So we're leveraging Carpenter on the backend for just-in-time instances for runner pools. [08:59.760 --> 09:04.440] That's a tool for AWS, it's open source, you can find it on GitHub. [09:04.440 --> 09:09.160] It essentially lets you make requests for nodes with certain amounts of memory, certain target [09:09.160 --> 09:13.720] architectures, and so on, and it manages containers on the instances for you on the [09:13.720 --> 09:22.560] backend and sort of moves work around so that you can have an efficient build pool in Kubernetes. [09:22.560 --> 09:27.600] We also have some bare-metal runners at the University of Oregon with more exotic architectures [09:27.600 --> 09:29.800] than you can maybe find in the cloud. [09:29.800 --> 09:35.640] So like there's an AMD MI 200 GPU builder in there, there's A64FX, which is what runs [09:35.640 --> 09:41.280] on Sugaku, it's the ARM architecture with vector instructions, Power9, and so on. [09:41.280 --> 09:46.120] And so we are able to do runs there for architectures that aren't supported in the cloud. [09:46.120 --> 09:48.280] There's some monitoring thrown in. [09:48.280 --> 09:52.080] We haven't really leveraged it in a smart way yet, but we are collecting a lot of data [09:52.080 --> 09:53.360] about our builds. [09:53.360 --> 09:58.440] And then there's a bot that helps sort of coordinate between GitHub and GitLab. [09:58.440 --> 10:03.960] And so we have sort of a sync script that allows us to build off of forks and things [10:03.960 --> 10:06.800] like that in GitLab over this whole setup. [10:06.800 --> 10:12.080] So it's fairly custom, but at least the GitLab component is recyclable internally. [10:12.080 --> 10:15.200] And we would like to be able to support more runners in the future, like if maybe we want [10:15.200 --> 10:19.080] to work with Azure on their HPC setup and they want to provide runners for the project [10:19.080 --> 10:24.160] or if other universities and places want to provide runners, we want to leave that open. [10:24.160 --> 10:30.040] For maintaining the stacks themselves, we made it possible to sort of instantiate a new [10:30.040 --> 10:31.880] stack in a pull request. [10:31.880 --> 10:35.880] And so we have this directory full of the sort of 16 stacks that we currently build [10:35.880 --> 10:36.880] in CI. [10:36.880 --> 10:38.040] You can see them there. [10:38.040 --> 10:44.920] Each one of those is some targeted software stack for some type of machine or some group. [10:44.920 --> 10:49.800] Each of those contains sort of a YAML file with configuration for the stack in it. [10:49.800 --> 10:52.600] And so the YAML file itself is fairly simple. [10:52.600 --> 10:56.520] It has a list of packages that you want to build, and so this is the machine learning [10:56.520 --> 10:58.280] one for CUDA. [10:58.280 --> 11:02.160] Those are all the names of the stack recipes that you're building here. [11:02.160 --> 11:04.040] And then some configuration up here. [11:04.040 --> 11:08.760] And so for this particular stack, you're saying, I want to build for x8664v3, which [11:08.760 --> 11:11.200] is AVX2. [11:11.200 --> 11:15.920] And I want to disable Rockum and enable CUDA, except on LLVM because there's some weird [11:15.920 --> 11:19.920] bug with the CUDA support there, at least in our stack. [11:19.920 --> 11:21.760] And so you can see it's fairly concise. [11:21.760 --> 11:22.840] You make a list of packages. [11:22.840 --> 11:27.600] You say, here's the configuration I want, and you can go and take this thing and build [11:27.600 --> 11:30.280] a bunch of packages. [11:30.280 --> 11:34.040] We make it easy to change sort of low-level stack-wide parameters. [11:34.040 --> 11:38.960] So the parameterized packages in stack, you can tell it to build with a different compiler. [11:38.960 --> 11:46.240] And so we had essentially this large E4S stack with maybe 600 packages working in standard [11:46.240 --> 11:47.240] environments. [11:47.240 --> 11:49.520] We wanted to support the one API compilers from Intel. [11:49.520 --> 11:52.800] And so that's Intel's new optimizing compilers. [11:52.800 --> 11:57.120] It is unlikely that anyone has ever run this much open source through a proprietary vendor [11:57.120 --> 11:59.920] compiler like that, but it is client-based. [11:59.920 --> 12:03.280] And so we were able to throw one API into the config by just saying, here's where one [12:03.280 --> 12:08.960] API lives, and make all packages require one API. [12:08.960 --> 12:12.880] And so the build system swaps in the one API compiler through some wrappers that are at [12:12.880 --> 12:13.880] the lower level. [12:13.880 --> 12:17.320] And we were able to get that stack working in a week or two, despite the fact that we've [12:17.320 --> 12:20.520] never built a lot of these packages with one API before. [12:20.520 --> 12:22.920] So I think that's actually pretty cool. [12:22.920 --> 12:26.520] In a lot of cases, it's not worth it to use a vendor compiler because there's so many [12:26.520 --> 12:29.880] bugs and issues with software that's never been built. [12:29.880 --> 12:34.880] But here, we're just really throwing sort of a bunch of open source packages through, [12:34.880 --> 12:37.120] and it helped us communicate with Intel. [12:37.120 --> 12:40.440] We were able to say, hey, here are bugs that we're seeing with your compiler. [12:40.440 --> 12:44.920] We can link you directly to the build log for the build that failed. [12:44.920 --> 12:49.000] And that helps them patch up the compiler, and it continues to help them ensure that [12:49.000 --> 12:54.160] it can build everything it needs to. [12:54.160 --> 12:56.760] In SPAC, you don't. [12:56.760 --> 13:00.680] So like I said, the recipes are these parameterized things, and so there's actually a solving [13:00.680 --> 13:02.960] step to these stacks. [13:02.960 --> 13:07.280] You saw sort of the requirements in the YAML file that said what I want to build. [13:07.280 --> 13:11.760] We run that through our packet solver to get sort of a fully resolved graph of all the [13:11.760 --> 13:14.400] things that need to be built in a stack. [13:14.400 --> 13:17.560] And then that is used to generate a GitLab CI YAML. [13:17.560 --> 13:22.080] And then for one of the problems that we have to solve there is mapping builds to runners. [13:22.080 --> 13:25.160] So once the whole thing is concrete, and we've said here's all the dependencies, these are [13:25.160 --> 13:29.760] all the exact build configurations we want to make, we have to say how that should be [13:29.760 --> 13:31.800] mapped to particular runners. [13:31.800 --> 13:36.000] And so we don't currently support things like cross builds. [13:36.000 --> 13:41.040] So if you want to build for AVX 512 or the more fancy vector instructions on newer Intel [13:41.040 --> 13:45.240] CPUs, you need to make sure that you get one of those CPUs in the build environment. [13:45.240 --> 13:49.720] And so we say, if you match AVX 512, give me an AVX 512 runner. [13:49.720 --> 13:54.600] If you match one of these somewhat atrocious, hard to build packages up here like LLVM and [13:54.600 --> 13:59.160] PyTorch, give me a gigantic runner with lots of memory, things like that. [13:59.160 --> 14:02.280] And essentially what this is doing is it's just saying, here's the package properties [14:02.280 --> 14:05.760] up at the top, here are the tags that should be on the runner, make sure that I get a runner [14:05.760 --> 14:08.600] with those capabilities. [14:08.600 --> 14:13.600] And we haven't got a schema for all the tags yet, but I think we could standardize this [14:13.600 --> 14:19.000] and make it easy for someone to plug in runners at their own site for this sort of thing. [14:19.000 --> 14:21.200] All right. [14:21.200 --> 14:27.200] So one of the things that we did here to ensure trust is we have essentially a build environment [14:27.200 --> 14:30.080] going on in pull requests. [14:30.080 --> 14:32.320] If you trust back, you're basically trusting the maintainers. [14:32.320 --> 14:36.880] We want to ensure that the binaries are things that are approved by the maintainers. [14:36.880 --> 14:42.040] And so we can't just distribute binaries that got built in pull requests. [14:42.040 --> 14:46.120] So when contributors submit package changes, we go and we have private buckets for every [14:46.120 --> 14:49.320] PR that we're supporting where we're doing the builds. [14:49.320 --> 14:51.200] The maintainers come along and say, oh, it worked. [14:51.200 --> 14:52.200] They review the code. [14:52.200 --> 14:55.920] And then they say, okay, we can merge that and rebuild everything on develop and sign. [14:55.920 --> 14:59.960] So essentially everything in the main release is getting built from only approved recipes. [14:59.960 --> 15:03.480] It's not using any binaries that were built in the PR. [15:03.480 --> 15:06.120] All right. [15:06.120 --> 15:13.680] The pull request integration, yeah, definitely makes things easy for contributors. [15:13.680 --> 15:18.720] And we were able to take the system and announce our public binary cache last June with something [15:18.720 --> 15:21.720] like 4600 builds in CI. [15:21.720 --> 15:23.200] And so it's mostly easy for contributors. [15:23.200 --> 15:25.360] They get a status update on their pull request. [15:25.360 --> 15:26.360] And mostly easy for users. [15:26.360 --> 15:30.360] They can just say, hey, use the binary mirror. [15:30.360 --> 15:32.360] So there are some problems. [15:32.360 --> 15:36.320] One issue is that build caches are a lot different from RPMs and devs. [15:36.320 --> 15:39.760] In most distributions, you would have sort of a stable ABI for your build cache. [15:39.760 --> 15:42.480] Your rebuild package, you can throw it in the mix with the others. [15:42.480 --> 15:46.440] Here if you modify one package, you really do have to rebuild all the dependents. [15:46.440 --> 15:50.920] And so if you modify XZ here, then you have to build everything that depends on it again [15:50.920 --> 15:52.360] in the build cache. [15:52.360 --> 15:55.360] And so what that can mean is if you have a gigantic software stack like this one and [15:55.360 --> 15:59.920] you modify, say, package conf at the bottom of it, it can trigger a massive rebuild of [15:59.920 --> 16:02.320] everything in the stack. [16:02.320 --> 16:05.800] And so that's one of the scalability problems that I think we're going to have to deal with [16:05.800 --> 16:09.760] in the long term is that you can get these really long-running pipelines. [16:09.760 --> 16:16.360] Caches like Visit and PyTorch and so on will build forever, and it frustrates contributors. [16:16.360 --> 16:21.720] The other sort of thing that happens is if you think about how the release works on develop, [16:21.720 --> 16:25.520] you're picking a commit every once in a while and building it. [16:25.520 --> 16:31.280] And if you have a PR that is sort of based behind the last develop build, that's OK. [16:31.280 --> 16:34.640] Although GitHub typically wants to merge that with head, which means that you'll build a [16:34.640 --> 16:37.600] lot of redundant things in your build environment. [16:37.600 --> 16:40.920] We can be picky and merge it with the last develop build to ensure that we get a lot [16:40.920 --> 16:42.840] of cache reuse in the build environment. [16:42.840 --> 16:47.760] But what that means is if we get a PR that's out ahead of the last develop build and say [16:47.760 --> 16:52.240] D up there is in progress, if you merge that second PR with D, you're basically going to [16:52.240 --> 16:56.000] be doing the same builds that D is doing but in a PR environment. [16:56.000 --> 17:01.520] And so if you have a bunch of those, we've brought GitLab down before by accidentally [17:01.520 --> 17:07.880] building all of those PRs that are not caught up with the latest or for which develop has [17:07.880 --> 17:09.760] not caught up with them. [17:09.760 --> 17:14.360] And so we have to be picky and hold back these guys until there's a build ahead of them so [17:14.360 --> 17:18.560] that we get enough reuse out of the cache to support this. [17:18.560 --> 17:23.800] So the other problem with long pipelines is that they, depending on how reliable your [17:23.800 --> 17:28.080] infrastructure is, the more things that you build in a pipeline, the more likely you already [17:28.080 --> 17:31.240] get a build failure somewhere. [17:31.240 --> 17:36.280] And so because we're building this cone of destruction in our pipelines, we are sort [17:36.280 --> 17:39.840] of subject to system failures happening in the pipeline somewhere. [17:39.840 --> 17:43.600] And so users have to kind of babysit and restart builds that have nothing to do with what they're [17:43.600 --> 17:44.920] contributing. [17:44.920 --> 17:48.800] So we're looking for ways that we could make that better. [17:48.800 --> 17:51.240] One issue that we have is consistency. [17:51.240 --> 17:56.120] So when you test on PRs, it's not always sufficient to ensure that your develop branch is working. [17:56.120 --> 18:01.520] So you may have this initial package state, a PR gets submitted, you test with new B. [18:01.520 --> 18:05.520] Another PR gets submitted, you test with new package C. [18:05.520 --> 18:09.400] If you take those and you don't require your PRs to be up to date with develop, when they [18:09.400 --> 18:14.800] both get merged, the state that's in develop is something that you've never tested because [18:14.800 --> 18:18.640] you have basically new versions of those two packages together now. [18:18.640 --> 18:21.240] And so there are ways to get around this. [18:21.240 --> 18:22.640] One of them is merge queues. [18:22.640 --> 18:26.200] So we're looking at merge queues as a way to scale this pipeline out. [18:26.200 --> 18:32.960] They essentially allow you to have pull requests with a small amount of testing where you then [18:32.960 --> 18:37.920] enqueue them in your sort of merge queue up there, that's the gray stuff. [18:37.920 --> 18:41.080] And they are sort of serialized for commit to develop. [18:41.080 --> 18:45.840] If they succeed, then they're merged directly in a fast forward fashion. [18:45.840 --> 18:50.640] And then basically the full testing is only done on the merge queue. [18:50.640 --> 18:54.480] And you always are assured that the thing that you tested is the thing that gets merged [18:54.480 --> 18:55.600] into develop. [18:55.600 --> 19:01.440] So we're looking very much forward to GitHub making merge queue available in the next couple [19:01.440 --> 19:02.440] of weeks. [19:02.440 --> 19:06.360] The other thing we think that could do is allow us to sort of stage the work on PRs. [19:06.360 --> 19:09.280] So we're looking at ways we could scale this out. [19:09.280 --> 19:14.120] Right now, for a relatively small number of packages, 4,600, we're able to build this, [19:14.120 --> 19:17.600] these massive rebuilds on PRs. [19:17.600 --> 19:21.200] But we need the stage to see how to scale it out further, so that's what we're looking [19:21.200 --> 19:22.200] at now. [19:22.200 --> 19:27.880] We might build only the package or only the package and direct dependence on PRs and maybe [19:27.880 --> 19:31.400] phase how much work we do on the develop builds as well. [19:31.400 --> 19:35.240] But we do need to do a full build every once in a while so that there's a consistent state [19:35.240 --> 19:36.240] in the build cache. [19:36.240 --> 19:37.240] So that's where we're at. [19:37.240 --> 19:38.240] Thanks. [19:38.240 --> 20:00.880] Thank you very much for the presentation. [20:00.880 --> 20:11.480] You mentioned quite a bit of other technologies, like Nix, Gwix, Dab, RPM. [20:11.480 --> 20:15.480] You could have mentioned Ombru as well, or maybe you did. [20:15.480 --> 20:16.480] And Docker. [20:16.480 --> 20:19.400] And it feels like all these tools could help you. [20:19.400 --> 20:20.400] Yeah. [20:20.400 --> 20:23.800] And it feels like you are building everything on your own. [20:23.800 --> 20:29.120] So is there a reason not to leverage any of these technologies? [20:29.120 --> 20:30.120] Which technologies do you mean? [20:30.120 --> 20:31.120] Yeah. [20:31.120 --> 20:32.120] So we are leveraging a lot of technologies, right? [20:32.120 --> 20:33.560] I guess which ones do you think we should? [20:33.560 --> 20:34.960] Nix, for example. [20:34.960 --> 20:37.120] So we don't. [20:37.120 --> 20:41.400] So Nix has essentially one version of everything in the mainline, right? [20:41.400 --> 20:46.560] And in the HPC environment, what we want you to be able to do is not build that one thing [20:46.560 --> 20:50.080] that's in the mainline, but to be able to build a one-off very easily. [20:50.080 --> 20:54.400] So the whole point of SPAC is think of it as Nix with a solver, right? [20:54.400 --> 20:57.840] It's Nix where you can say, actually, no, build this version of this thing with this [20:57.840 --> 21:02.360] build option for that GPU, and it will take the recipe and reuse it for that purpose. [21:02.360 --> 21:05.320] Whereas in Nix, it's much harder to have package variants like that. [21:05.320 --> 21:07.920] So that's really the power of SPAC. [21:07.920 --> 21:10.120] And so we're combinatorial Nix. [21:10.120 --> 21:11.680] You can think of it that way. [21:11.680 --> 21:17.320] Well, wouldn't you be able to leverage Nix and describe all these differences instead [21:17.320 --> 21:19.320] of redoing it? [21:19.320 --> 21:20.320] No. [21:20.320 --> 21:28.320] The Nix packages don't do that.