[00:00.000 --> 00:08.440]  All right, so I'm Todd Gamble, and I'm from Lawrence Livermore National Laboratory.
[00:08.440 --> 00:13.160]  Normally I would give an intro of what Livermore is, but who's been hearing about Livermore
[00:13.160 --> 00:14.880]  in the news lately?
[00:14.880 --> 00:18.720]  The people heard about the fusion ignition over in the US, that's our lab.
[00:18.720 --> 00:21.080]  So I'm from there.
[00:21.080 --> 00:26.360]  I work in the HPC area at Livermore, and so we have a big supercomputing center.
[00:26.360 --> 00:30.500]  And the HPC ecosystem is a pretty complex place.
[00:30.500 --> 00:33.200]  People distribute software, mostly as source.
[00:33.200 --> 00:36.480]  You build lots of different variants of the package.
[00:36.480 --> 00:39.680]  Users typically don't have root on the machine when they install software, and so they're
[00:39.680 --> 00:43.720]  building from source in their home directory or installing something in their home directory.
[00:43.720 --> 00:47.760]  And you want the code to be optimized for fancy machines like these ones over here.
[00:47.760 --> 00:53.000]  So you're trying to build software that supports a really broad set of environments, including
[00:53.000 --> 00:58.560]  like Power, ARM, AMD, Intel, and then also GPU architectures.
[00:58.560 --> 01:03.520]  So things like NVIDIA and now AMD GPUs are showing up, and we've even got a machine coming
[01:03.520 --> 01:05.520]  all out at Argonne.
[01:05.520 --> 01:09.840]  This is near Chicago with Intel Panaveco GPUs.
[01:09.840 --> 01:15.680]  On top of all that, the ecosystem has C, C++, Fortran, Python, other languages, Lua, all
[01:15.680 --> 01:17.600]  linked together in the same app.
[01:17.600 --> 01:21.620]  And so we want a distribution that can support this type of environment.
[01:21.620 --> 01:25.800]  And so SPAC is a package manager that enables software distribution for HPC, given that
[01:25.800 --> 01:28.600]  set of constraints.
[01:28.600 --> 01:32.640]  Packages are not quite like the build specs that you would see in your standard RPM or
[01:32.640 --> 01:33.920]  Deb-based distribution.
[01:33.920 --> 01:38.760]  They're really parameterized Python recipes for how to build that package on lots of different
[01:38.760 --> 01:39.760]  architectures.
[01:39.760 --> 01:41.080]  And it has a DSL for doing that.
[01:41.080 --> 01:43.120]  I'm not going to get into that today.
[01:43.120 --> 01:47.800]  But the end user can essentially take one package and install it lots of different ways.
[01:47.800 --> 01:51.320]  So you could say, I want to install HDF5 at a particular version.
[01:51.320 --> 01:54.320]  I want to install it with Clang, not GCC.
[01:54.320 --> 01:59.080]  I want to have the thread safe option on, or I want to inject some flags in the build
[01:59.080 --> 02:02.920]  and have an entirely different version of it that's built with a different set of flags,
[02:02.920 --> 02:07.640]  or that's targeted at a particular micro-architecture, or that maybe uses a particular dependency.
[02:07.640 --> 02:12.200]  So you can build the same package with two versions of MPI.
[02:12.200 --> 02:16.720]  So we're trying to provide the ease of use of mainstream tools with the flexibility needed
[02:16.720 --> 02:19.680]  for HPC so that we can get the performance everyone.
[02:19.680 --> 02:25.640]  And it builds from source, but you can also install relocatable build caches in SPAC, much
[02:25.640 --> 02:29.360]  like you would with, say, Nix or Geeks.
[02:29.360 --> 02:31.920]  They're not relocatable because they're not really targeting the sort of home directory
[02:31.920 --> 02:35.000]  use case, but it's the same sort of build cache model.
[02:35.000 --> 02:37.800]  It's not a typical binary distribution.
[02:37.800 --> 02:41.520]  The whole project has a fairly large community of contributors, or at least maybe not large
[02:41.520 --> 02:46.260]  by some of the other distribution standards, but we have 1100-plus contributors.
[02:46.260 --> 02:49.560]  We maintain the core tool, and then there's a whole bunch of people who work on package
[02:49.560 --> 02:50.560]  recipes.
[02:50.560 --> 02:53.640]  So in some ways, it looks a lot like Homebrew or a project like that.
[02:53.640 --> 02:57.360]  And then there's a whole bunch of infrastructure behind the scenes to keep all this working,
[02:57.360 --> 03:00.760]  and all these things together enable people to build lots of different software stacks.
[03:00.760 --> 03:05.840]  And so there's like an extreme-scale software stack that's maintained by the US Exascale
[03:05.840 --> 03:06.840]  Project.
[03:06.840 --> 03:12.440]  AWS has a stack that they use on their parallel cluster product internally, and also for users.
[03:12.440 --> 03:14.280]  Livermore has its internal software deployment.
[03:14.280 --> 03:18.480]  There are some math library stacks, VizTools, things like that.
[03:18.480 --> 03:21.600]  And every application, really, in HPC is kind of its own software stack.
[03:21.600 --> 03:29.560]  So you heard about flat packs and snaps in the last session, well, really, making apps
[03:29.560 --> 03:34.640]  more mindful of how their software is actually a distribution is something that we've been
[03:34.640 --> 03:38.960]  pushing for a long time within HPC.
[03:38.960 --> 03:40.920]  The GitHub is a pretty busy place.
[03:40.920 --> 03:46.560]  We merge 300 to 500 PRs per month, and it's like something like 411 commits or more.
[03:46.560 --> 03:51.040]  And so managing that is kind of painful.
[03:51.040 --> 03:56.520]  And we're trying very hard to reduce downstream work, which is actually difficult for a source-based
[03:56.520 --> 03:58.240]  distribution.
[03:58.240 --> 04:01.600]  If you think about how SPAC is structured, there's this mainline develop branch that
[04:01.600 --> 04:02.600]  actually most people use.
[04:02.600 --> 04:05.400]  They'll just clone it straight from the repo, build from that, kind of like you do with
[04:05.400 --> 04:07.600]  mixed packages or something.
[04:07.600 --> 04:09.640]  External contributors contribute there.
[04:09.640 --> 04:14.080]  And we cut a release every once in a while where we stabilize the packages and keep them
[04:14.080 --> 04:19.560]  sort of fixed so that you don't have a lot of version churn in the repo.
[04:19.560 --> 04:23.560]  And then to actually integrate with the HPC facilities, all the places that are deploying
[04:23.560 --> 04:29.640]  supercomputers, we have this E4S software distribution where they end up doing a whole
[04:29.640 --> 04:33.120]  bunch of downstream integration at the site, where they're basically building the whole
[04:33.120 --> 04:36.520]  thing from source, essentially in a new environment.
[04:36.520 --> 04:39.680]  And there's a whole lot of debugging that takes place there that we would really like
[04:39.680 --> 04:42.200]  to be able to move upstream.
[04:42.200 --> 04:46.600]  The applications, likewise, they are not necessarily using what the facility deploys.
[04:46.600 --> 04:47.600]  Some of them do.
[04:47.600 --> 04:48.680]  Some of them don't.
[04:48.680 --> 04:50.520]  They pull from basically all of these places.
[04:50.520 --> 04:54.080]  They might get a math solver library from the facility.
[04:54.080 --> 04:58.760]  They might get something else installed from SPAC mainline built the way that they want.
[04:58.760 --> 05:02.960]  And they may pull stuff off of release branches too, all to assemble an application and have
[05:02.960 --> 05:04.200]  it built.
[05:04.200 --> 05:09.280]  And so this is a lot of porting at the lowest end, and what we'd really like to do is take
[05:09.280 --> 05:16.000]  that software integration and move it upstream and get to a point where we can have these
[05:16.000 --> 05:22.160]  types of environments building NCI all the time in sort of a rolling release and do binary
[05:22.160 --> 05:25.920]  deploys on the supercomputers with actual optimized binaries.
[05:25.920 --> 05:27.280]  So that's what we're trying to get to.
[05:27.280 --> 05:30.920]  So we set out to make a binary distribution with a bunch of different goals.
[05:30.920 --> 05:37.040]  The main one, and the one that's pretty key to our whole ecosystem, is it has to be sustainable.
[05:37.040 --> 05:39.360]  We don't have that many maintainers.
[05:39.360 --> 05:43.760]  And they currently, their workflow is basically to work with people who are making contributions,
[05:43.760 --> 05:47.280]  on pull requests, help them get them merged, and then move on to the next one.
[05:47.280 --> 05:51.560]  And we don't want them to have to sit around and babysit builds on, say, a release integration
[05:51.560 --> 05:54.080]  branch all the time.
[05:54.080 --> 05:57.960]  We want a rolling release because people do tend to use the develop branch.
[05:57.960 --> 06:01.880]  And so we want that to be up to date with pretty current binaries all the time.
[06:01.880 --> 06:06.440]  But some people do fix themselves to releases, and so we want sort of snapshots for those
[06:06.440 --> 06:08.040]  releases as well.
[06:08.040 --> 06:12.360]  We need to be able to support, at least eventually, all the packages that are in SPAC.
[06:12.360 --> 06:15.320]  And it still has to be source-buildable around those binaries.
[06:15.320 --> 06:19.160]  So if you want to build a component and rely on binaries for some other component, we want
[06:19.160 --> 06:20.520]  to support that.
[06:20.520 --> 06:23.600]  And then finally, people trust sources.
[06:23.600 --> 06:24.600]  They can check some of them.
[06:24.600 --> 06:25.600]  You can download the tarball.
[06:25.600 --> 06:29.520]  You can usually check some of them, except for when GitHub changes the hashes.
[06:29.520 --> 06:34.160]  But we want to ensure that the binaries that we're generating are just as trustworthy as
[06:34.160 --> 06:35.160]  the sources.
[06:35.160 --> 06:39.240]  So we've taken some steps to ensure that.
[06:39.240 --> 06:44.160]  So SPAC is a little different from your standard distro if you haven't gathered already.
[06:44.160 --> 06:49.400]  If you think about a traditional package manager, you have a sort of a recipe per configuration.
[06:49.400 --> 06:52.240]  And so that's like your RPM build spec or dev spec or whatever.
[06:52.240 --> 06:56.040]  It goes into a build farm, and you produce packages, at least for one platform, in sort
[06:56.040 --> 07:00.600]  of a one-to-one relationship with those specs, actually.
[07:00.600 --> 07:03.640]  There's templating and things that goes on to reduce that.
[07:03.640 --> 07:07.800]  But you're typically maintaining one software stack that gets updated over time.
[07:07.800 --> 07:11.840]  In SPAC, what we're trying to do is we have these parameterized package recipes that go
[07:11.840 --> 07:15.200]  into build farm, but it's really the same recipe that's being used across different
[07:15.200 --> 07:16.200]  architectures.
[07:16.200 --> 07:19.880]  We force the contributors to work on the same package so that essentially you're modeling
[07:19.880 --> 07:23.200]  all the different ways the software can be used, and we try to get a lot of reuse out
[07:23.200 --> 07:25.560]  of the recipes across platforms.
[07:25.560 --> 07:28.920]  Those go into the build farm, and you can use the same recipes to produce optimized binaries
[07:28.920 --> 07:29.920]  for lots of different platforms.
[07:29.920 --> 07:34.040]  So you could get a graviton, arm build, you could get a Skylake binary, you could get
[07:34.040 --> 07:35.920]  a GPU build, and so on.
[07:35.920 --> 07:40.240]  And then you could do that for many different software stacks for different use cases.
[07:40.240 --> 07:43.320]  And then we want you to be able to build from source on top of that.
[07:43.320 --> 07:45.440]  So that's what we're trying to do.
[07:45.440 --> 07:50.520]  We put a CI architecture together that is sort of based around this.
[07:50.520 --> 07:53.280]  Like I said, we want to be sustainable, we want to maintain the workflow that we already
[07:53.280 --> 07:56.680]  have on the project, and so we want people, we want basically GitHub to be the center
[07:56.680 --> 07:58.280]  of the distribution.
[07:58.280 --> 08:01.760]  What goes into develop is really maintaining the distribution as well as contributing to
[08:01.760 --> 08:02.880]  the project.
[08:02.880 --> 08:07.960]  And so we have a bunch of infrastructure currently stood up in AWS to support this.
[08:07.960 --> 08:12.360]  So the binaries themselves and the sources are all distributed through S3 and CloudFront.
[08:12.360 --> 08:18.520]  We set up a big Kubernetes cluster to support autoscaling runners, and we're using high
[08:18.520 --> 08:22.800]  availability GitLab in there to drive the CI.
[08:22.800 --> 08:26.000]  GitLab may seem like a strange choice for maintaining a distribution, but the motivation
[08:26.000 --> 08:30.320]  behind that is really that all of the HPC centers also have internal GitLabs, and so
[08:30.320 --> 08:32.480]  do a lot of universities and other sites.
[08:32.480 --> 08:37.360]  And so the goal is really for all of this automation and tooling to be usable not just
[08:37.360 --> 08:41.840]  in the cloud for the large distribution of SPAC, but also for people's personal software
[08:41.840 --> 08:43.280]  stacks locally.
[08:43.280 --> 08:48.560]  And so the idea is that we're generating GitLab CI configuration, and you can use that either
[08:48.560 --> 08:54.320]  for this or internally or in an air gap network somewhere.
[08:54.320 --> 08:59.760]  So we're leveraging Carpenter on the backend for just-in-time instances for runner pools.
[08:59.760 --> 09:04.440]  That's a tool for AWS, it's open source, you can find it on GitHub.
[09:04.440 --> 09:09.160]  It essentially lets you make requests for nodes with certain amounts of memory, certain target
[09:09.160 --> 09:13.720]  architectures, and so on, and it manages containers on the instances for you on the
[09:13.720 --> 09:22.560]  backend and sort of moves work around so that you can have an efficient build pool in Kubernetes.
[09:22.560 --> 09:27.600]  We also have some bare-metal runners at the University of Oregon with more exotic architectures
[09:27.600 --> 09:29.800]  than you can maybe find in the cloud.
[09:29.800 --> 09:35.640]  So like there's an AMD MI 200 GPU builder in there, there's A64FX, which is what runs
[09:35.640 --> 09:41.280]  on Sugaku, it's the ARM architecture with vector instructions, Power9, and so on.
[09:41.280 --> 09:46.120]  And so we are able to do runs there for architectures that aren't supported in the cloud.
[09:46.120 --> 09:48.280]  There's some monitoring thrown in.
[09:48.280 --> 09:52.080]  We haven't really leveraged it in a smart way yet, but we are collecting a lot of data
[09:52.080 --> 09:53.360]  about our builds.
[09:53.360 --> 09:58.440]  And then there's a bot that helps sort of coordinate between GitHub and GitLab.
[09:58.440 --> 10:03.960]  And so we have sort of a sync script that allows us to build off of forks and things
[10:03.960 --> 10:06.800]  like that in GitLab over this whole setup.
[10:06.800 --> 10:12.080]  So it's fairly custom, but at least the GitLab component is recyclable internally.
[10:12.080 --> 10:15.200]  And we would like to be able to support more runners in the future, like if maybe we want
[10:15.200 --> 10:19.080]  to work with Azure on their HPC setup and they want to provide runners for the project
[10:19.080 --> 10:24.160]  or if other universities and places want to provide runners, we want to leave that open.
[10:24.160 --> 10:30.040]  For maintaining the stacks themselves, we made it possible to sort of instantiate a new
[10:30.040 --> 10:31.880]  stack in a pull request.
[10:31.880 --> 10:35.880]  And so we have this directory full of the sort of 16 stacks that we currently build
[10:35.880 --> 10:36.880]  in CI.
[10:36.880 --> 10:38.040]  You can see them there.
[10:38.040 --> 10:44.920]  Each one of those is some targeted software stack for some type of machine or some group.
[10:44.920 --> 10:49.800]  Each of those contains sort of a YAML file with configuration for the stack in it.
[10:49.800 --> 10:52.600]  And so the YAML file itself is fairly simple.
[10:52.600 --> 10:56.520]  It has a list of packages that you want to build, and so this is the machine learning
[10:56.520 --> 10:58.280]  one for CUDA.
[10:58.280 --> 11:02.160]  Those are all the names of the stack recipes that you're building here.
[11:02.160 --> 11:04.040]  And then some configuration up here.
[11:04.040 --> 11:08.760]  And so for this particular stack, you're saying, I want to build for x8664v3, which
[11:08.760 --> 11:11.200]  is AVX2.
[11:11.200 --> 11:15.920]  And I want to disable Rockum and enable CUDA, except on LLVM because there's some weird
[11:15.920 --> 11:19.920]  bug with the CUDA support there, at least in our stack.
[11:19.920 --> 11:21.760]  And so you can see it's fairly concise.
[11:21.760 --> 11:22.840]  You make a list of packages.
[11:22.840 --> 11:27.600]  You say, here's the configuration I want, and you can go and take this thing and build
[11:27.600 --> 11:30.280]  a bunch of packages.
[11:30.280 --> 11:34.040]  We make it easy to change sort of low-level stack-wide parameters.
[11:34.040 --> 11:38.960]  So the parameterized packages in stack, you can tell it to build with a different compiler.
[11:38.960 --> 11:46.240]  And so we had essentially this large E4S stack with maybe 600 packages working in standard
[11:46.240 --> 11:47.240]  environments.
[11:47.240 --> 11:49.520]  We wanted to support the one API compilers from Intel.
[11:49.520 --> 11:52.800]  And so that's Intel's new optimizing compilers.
[11:52.800 --> 11:57.120]  It is unlikely that anyone has ever run this much open source through a proprietary vendor
[11:57.120 --> 11:59.920]  compiler like that, but it is client-based.
[11:59.920 --> 12:03.280]  And so we were able to throw one API into the config by just saying, here's where one
[12:03.280 --> 12:08.960]  API lives, and make all packages require one API.
[12:08.960 --> 12:12.880]  And so the build system swaps in the one API compiler through some wrappers that are at
[12:12.880 --> 12:13.880]  the lower level.
[12:13.880 --> 12:17.320]  And we were able to get that stack working in a week or two, despite the fact that we've
[12:17.320 --> 12:20.520]  never built a lot of these packages with one API before.
[12:20.520 --> 12:22.920]  So I think that's actually pretty cool.
[12:22.920 --> 12:26.520]  In a lot of cases, it's not worth it to use a vendor compiler because there's so many
[12:26.520 --> 12:29.880]  bugs and issues with software that's never been built.
[12:29.880 --> 12:34.880]  But here, we're just really throwing sort of a bunch of open source packages through,
[12:34.880 --> 12:37.120]  and it helped us communicate with Intel.
[12:37.120 --> 12:40.440]  We were able to say, hey, here are bugs that we're seeing with your compiler.
[12:40.440 --> 12:44.920]  We can link you directly to the build log for the build that failed.
[12:44.920 --> 12:49.000]  And that helps them patch up the compiler, and it continues to help them ensure that
[12:49.000 --> 12:54.160]  it can build everything it needs to.
[12:54.160 --> 12:56.760]  In SPAC, you don't.
[12:56.760 --> 13:00.680]  So like I said, the recipes are these parameterized things, and so there's actually a solving
[13:00.680 --> 13:02.960]  step to these stacks.
[13:02.960 --> 13:07.280]  You saw sort of the requirements in the YAML file that said what I want to build.
[13:07.280 --> 13:11.760]  We run that through our packet solver to get sort of a fully resolved graph of all the
[13:11.760 --> 13:14.400]  things that need to be built in a stack.
[13:14.400 --> 13:17.560]  And then that is used to generate a GitLab CI YAML.
[13:17.560 --> 13:22.080]  And then for one of the problems that we have to solve there is mapping builds to runners.
[13:22.080 --> 13:25.160]  So once the whole thing is concrete, and we've said here's all the dependencies, these are
[13:25.160 --> 13:29.760]  all the exact build configurations we want to make, we have to say how that should be
[13:29.760 --> 13:31.800]  mapped to particular runners.
[13:31.800 --> 13:36.000]  And so we don't currently support things like cross builds.
[13:36.000 --> 13:41.040]  So if you want to build for AVX 512 or the more fancy vector instructions on newer Intel
[13:41.040 --> 13:45.240]  CPUs, you need to make sure that you get one of those CPUs in the build environment.
[13:45.240 --> 13:49.720]  And so we say, if you match AVX 512, give me an AVX 512 runner.
[13:49.720 --> 13:54.600]  If you match one of these somewhat atrocious, hard to build packages up here like LLVM and
[13:54.600 --> 13:59.160]  PyTorch, give me a gigantic runner with lots of memory, things like that.
[13:59.160 --> 14:02.280]  And essentially what this is doing is it's just saying, here's the package properties
[14:02.280 --> 14:05.760]  up at the top, here are the tags that should be on the runner, make sure that I get a runner
[14:05.760 --> 14:08.600]  with those capabilities.
[14:08.600 --> 14:13.600]  And we haven't got a schema for all the tags yet, but I think we could standardize this
[14:13.600 --> 14:19.000]  and make it easy for someone to plug in runners at their own site for this sort of thing.
[14:19.000 --> 14:21.200]  All right.
[14:21.200 --> 14:27.200]  So one of the things that we did here to ensure trust is we have essentially a build environment
[14:27.200 --> 14:30.080]  going on in pull requests.
[14:30.080 --> 14:32.320]  If you trust back, you're basically trusting the maintainers.
[14:32.320 --> 14:36.880]  We want to ensure that the binaries are things that are approved by the maintainers.
[14:36.880 --> 14:42.040]  And so we can't just distribute binaries that got built in pull requests.
[14:42.040 --> 14:46.120]  So when contributors submit package changes, we go and we have private buckets for every
[14:46.120 --> 14:49.320]  PR that we're supporting where we're doing the builds.
[14:49.320 --> 14:51.200]  The maintainers come along and say, oh, it worked.
[14:51.200 --> 14:52.200]  They review the code.
[14:52.200 --> 14:55.920]  And then they say, okay, we can merge that and rebuild everything on develop and sign.
[14:55.920 --> 14:59.960]  So essentially everything in the main release is getting built from only approved recipes.
[14:59.960 --> 15:03.480]  It's not using any binaries that were built in the PR.
[15:03.480 --> 15:06.120]  All right.
[15:06.120 --> 15:13.680]  The pull request integration, yeah, definitely makes things easy for contributors.
[15:13.680 --> 15:18.720]  And we were able to take the system and announce our public binary cache last June with something
[15:18.720 --> 15:21.720]  like 4600 builds in CI.
[15:21.720 --> 15:23.200]  And so it's mostly easy for contributors.
[15:23.200 --> 15:25.360]  They get a status update on their pull request.
[15:25.360 --> 15:26.360]  And mostly easy for users.
[15:26.360 --> 15:30.360]  They can just say, hey, use the binary mirror.
[15:30.360 --> 15:32.360]  So there are some problems.
[15:32.360 --> 15:36.320]  One issue is that build caches are a lot different from RPMs and devs.
[15:36.320 --> 15:39.760]  In most distributions, you would have sort of a stable ABI for your build cache.
[15:39.760 --> 15:42.480]  Your rebuild package, you can throw it in the mix with the others.
[15:42.480 --> 15:46.440]  Here if you modify one package, you really do have to rebuild all the dependents.
[15:46.440 --> 15:50.920]  And so if you modify XZ here, then you have to build everything that depends on it again
[15:50.920 --> 15:52.360]  in the build cache.
[15:52.360 --> 15:55.360]  And so what that can mean is if you have a gigantic software stack like this one and
[15:55.360 --> 15:59.920]  you modify, say, package conf at the bottom of it, it can trigger a massive rebuild of
[15:59.920 --> 16:02.320]  everything in the stack.
[16:02.320 --> 16:05.800]  And so that's one of the scalability problems that I think we're going to have to deal with
[16:05.800 --> 16:09.760]  in the long term is that you can get these really long-running pipelines.
[16:09.760 --> 16:16.360]  Caches like Visit and PyTorch and so on will build forever, and it frustrates contributors.
[16:16.360 --> 16:21.720]  The other sort of thing that happens is if you think about how the release works on develop,
[16:21.720 --> 16:25.520]  you're picking a commit every once in a while and building it.
[16:25.520 --> 16:31.280]  And if you have a PR that is sort of based behind the last develop build, that's OK.
[16:31.280 --> 16:34.640]  Although GitHub typically wants to merge that with head, which means that you'll build a
[16:34.640 --> 16:37.600]  lot of redundant things in your build environment.
[16:37.600 --> 16:40.920]  We can be picky and merge it with the last develop build to ensure that we get a lot
[16:40.920 --> 16:42.840]  of cache reuse in the build environment.
[16:42.840 --> 16:47.760]  But what that means is if we get a PR that's out ahead of the last develop build and say
[16:47.760 --> 16:52.240]  D up there is in progress, if you merge that second PR with D, you're basically going to
[16:52.240 --> 16:56.000]  be doing the same builds that D is doing but in a PR environment.
[16:56.000 --> 17:01.520]  And so if you have a bunch of those, we've brought GitLab down before by accidentally
[17:01.520 --> 17:07.880]  building all of those PRs that are not caught up with the latest or for which develop has
[17:07.880 --> 17:09.760]  not caught up with them.
[17:09.760 --> 17:14.360]  And so we have to be picky and hold back these guys until there's a build ahead of them so
[17:14.360 --> 17:18.560]  that we get enough reuse out of the cache to support this.
[17:18.560 --> 17:23.800]  So the other problem with long pipelines is that they, depending on how reliable your
[17:23.800 --> 17:28.080]  infrastructure is, the more things that you build in a pipeline, the more likely you already
[17:28.080 --> 17:31.240]  get a build failure somewhere.
[17:31.240 --> 17:36.280]  And so because we're building this cone of destruction in our pipelines, we are sort
[17:36.280 --> 17:39.840]  of subject to system failures happening in the pipeline somewhere.
[17:39.840 --> 17:43.600]  And so users have to kind of babysit and restart builds that have nothing to do with what they're
[17:43.600 --> 17:44.920]  contributing.
[17:44.920 --> 17:48.800]  So we're looking for ways that we could make that better.
[17:48.800 --> 17:51.240]  One issue that we have is consistency.
[17:51.240 --> 17:56.120]  So when you test on PRs, it's not always sufficient to ensure that your develop branch is working.
[17:56.120 --> 18:01.520]  So you may have this initial package state, a PR gets submitted, you test with new B.
[18:01.520 --> 18:05.520]  Another PR gets submitted, you test with new package C.
[18:05.520 --> 18:09.400]  If you take those and you don't require your PRs to be up to date with develop, when they
[18:09.400 --> 18:14.800]  both get merged, the state that's in develop is something that you've never tested because
[18:14.800 --> 18:18.640]  you have basically new versions of those two packages together now.
[18:18.640 --> 18:21.240]  And so there are ways to get around this.
[18:21.240 --> 18:22.640]  One of them is merge queues.
[18:22.640 --> 18:26.200]  So we're looking at merge queues as a way to scale this pipeline out.
[18:26.200 --> 18:32.960]  They essentially allow you to have pull requests with a small amount of testing where you then
[18:32.960 --> 18:37.920]  enqueue them in your sort of merge queue up there, that's the gray stuff.
[18:37.920 --> 18:41.080]  And they are sort of serialized for commit to develop.
[18:41.080 --> 18:45.840]  If they succeed, then they're merged directly in a fast forward fashion.
[18:45.840 --> 18:50.640]  And then basically the full testing is only done on the merge queue.
[18:50.640 --> 18:54.480]  And you always are assured that the thing that you tested is the thing that gets merged
[18:54.480 --> 18:55.600]  into develop.
[18:55.600 --> 19:01.440]  So we're looking very much forward to GitHub making merge queue available in the next couple
[19:01.440 --> 19:02.440]  of weeks.
[19:02.440 --> 19:06.360]  The other thing we think that could do is allow us to sort of stage the work on PRs.
[19:06.360 --> 19:09.280]  So we're looking at ways we could scale this out.
[19:09.280 --> 19:14.120]  Right now, for a relatively small number of packages, 4,600, we're able to build this,
[19:14.120 --> 19:17.600]  these massive rebuilds on PRs.
[19:17.600 --> 19:21.200]  But we need the stage to see how to scale it out further, so that's what we're looking
[19:21.200 --> 19:22.200]  at now.
[19:22.200 --> 19:27.880]  We might build only the package or only the package and direct dependence on PRs and maybe
[19:27.880 --> 19:31.400]  phase how much work we do on the develop builds as well.
[19:31.400 --> 19:35.240]  But we do need to do a full build every once in a while so that there's a consistent state
[19:35.240 --> 19:36.240]  in the build cache.
[19:36.240 --> 19:37.240]  So that's where we're at.
[19:37.240 --> 19:38.240]  Thanks.
[19:38.240 --> 20:00.880]  Thank you very much for the presentation.
[20:00.880 --> 20:11.480]  You mentioned quite a bit of other technologies, like Nix, Gwix, Dab, RPM.
[20:11.480 --> 20:15.480]  You could have mentioned Ombru as well, or maybe you did.
[20:15.480 --> 20:16.480]  And Docker.
[20:16.480 --> 20:19.400]  And it feels like all these tools could help you.
[20:19.400 --> 20:20.400]  Yeah.
[20:20.400 --> 20:23.800]  And it feels like you are building everything on your own.
[20:23.800 --> 20:29.120]  So is there a reason not to leverage any of these technologies?
[20:29.120 --> 20:30.120]  Which technologies do you mean?
[20:30.120 --> 20:31.120]  Yeah.
[20:31.120 --> 20:32.120]  So we are leveraging a lot of technologies, right?
[20:32.120 --> 20:33.560]  I guess which ones do you think we should?
[20:33.560 --> 20:34.960]  Nix, for example.
[20:34.960 --> 20:37.120]  So we don't.
[20:37.120 --> 20:41.400]  So Nix has essentially one version of everything in the mainline, right?
[20:41.400 --> 20:46.560]  And in the HPC environment, what we want you to be able to do is not build that one thing
[20:46.560 --> 20:50.080]  that's in the mainline, but to be able to build a one-off very easily.
[20:50.080 --> 20:54.400]  So the whole point of SPAC is think of it as Nix with a solver, right?
[20:54.400 --> 20:57.840]  It's Nix where you can say, actually, no, build this version of this thing with this
[20:57.840 --> 21:02.360]  build option for that GPU, and it will take the recipe and reuse it for that purpose.
[21:02.360 --> 21:05.320]  Whereas in Nix, it's much harder to have package variants like that.
[21:05.320 --> 21:07.920]  So that's really the power of SPAC.
[21:07.920 --> 21:10.120]  And so we're combinatorial Nix.
[21:10.120 --> 21:11.680]  You can think of it that way.
[21:11.680 --> 21:17.320]  Well, wouldn't you be able to leverage Nix and describe all these differences instead
[21:17.320 --> 21:19.320]  of redoing it?
[21:19.320 --> 21:20.320]  No.
[21:20.320 --> 21:28.320]  The Nix packages don't do that.