[00:00.000 --> 00:11.360] Okay, final lightning talk for today is Ludovic talking about geeks. [00:11.360 --> 00:14.000] All right, thank you. [00:14.000 --> 00:15.680] Hello HPC people. [00:15.680 --> 00:17.200] So my name is Ludovic Cortes. [00:17.200 --> 00:23.000] I work at INRIER, which is a French research institute in France in computer science. [00:23.000 --> 00:24.600] And I work as a research engineer. [00:24.600 --> 00:29.040] So I'm very much concerned about engineering issues in general. [00:29.040 --> 00:31.520] And in particular, I'm concerned about deployment. [00:31.520 --> 00:37.520] So if you're an HPC dev room aficionado, we've probably made before. [00:37.520 --> 00:43.160] I gave a couple of talks, I guess, in this room, more specifically about geeks. [00:43.160 --> 00:44.920] So maybe you're afraid about geeks. [00:44.920 --> 00:46.480] It's a software deployment tool. [00:46.480 --> 00:52.040] So we have Easy Builds Pack, also RPM, well, you know, app, et cetera. [00:52.040 --> 00:55.360] And this is yet another deployment tool, if you will. [00:55.360 --> 00:59.600] But we have this very particular vision where, you know, the grand vision where we're trying [00:59.600 --> 01:04.440] to build a tool for reproducible research and HPC. [01:04.440 --> 01:07.600] So the thing here that you see is the vision, so to speak. [01:07.600 --> 01:12.000] So at one end of the spectrum, we have, you know, research articles and we want the research [01:12.000 --> 01:13.480] to be solid. [01:13.480 --> 01:17.000] So we want the computational workflows to be reproducible. [01:17.000 --> 01:21.080] And at the other end of the spectrum on the left, we have archives, source code archives [01:21.080 --> 01:25.840] like software heritage, which we really need to have if we want that scientific source [01:25.840 --> 01:29.320] code to, you know, to remain available over time. [01:29.320 --> 01:34.000] And in the middle, while we need a bunch of tools, in particular, deployment tool like [01:34.000 --> 01:37.960] geeks to reproduce, well, to deploy software reproducibly. [01:37.960 --> 01:39.720] Yes. [01:39.720 --> 01:44.920] So in a nutshell, yes, geeks provides actual tools for reproducible research people. [01:44.920 --> 01:49.400] I'm not going to go into details, but basically you can say, all right, I've made an experiment, [01:49.400 --> 01:50.800] a computational experiment. [01:50.800 --> 01:54.200] So now I'm going to pin the exact revision of geeks that I used. [01:54.200 --> 01:56.360] This is the first command here. [01:56.360 --> 02:00.960] And the second command is, you know, some time later or some colleague wants to reproduce [02:00.960 --> 02:02.120] the results. [02:02.120 --> 02:06.000] And so they use the time machine to jump to that specific revision of geeks. [02:06.000 --> 02:10.560] And from there, they deploy the exact same packages that I have done in that manifest [02:10.560 --> 02:12.960] file, bit for bit. [02:12.960 --> 02:14.480] That's the idea. [02:14.480 --> 02:15.720] All right. [02:15.720 --> 02:19.680] So in HPC, I guess most people in this room would agree. [02:19.680 --> 02:22.520] We have two obsessions. [02:22.520 --> 02:26.880] That's MPI and AVX, well, vector instructions. [02:26.880 --> 02:28.640] We want things to run fast, right? [02:28.640 --> 02:29.960] We have those fancy clusters. [02:29.960 --> 02:33.200] So we want to make sure that the communications are going to be fast. [02:33.200 --> 02:37.600] We want to make sure we're going to use the latest vector instructions of our CPUs. [02:37.600 --> 02:39.720] And that makes a lot of sense. [02:39.720 --> 02:45.800] But sometimes we're going, maybe we have preconceptions about the implications of all this. [02:45.800 --> 02:49.920] So here I'm creating Todd Gamblin, who's maybe in this room actually. [02:49.920 --> 02:52.400] Hi, Todd, if you see me. [02:52.400 --> 02:57.120] This is an example where we, well, Todd here was saying, you know, binaries, distributions [02:57.120 --> 03:02.800] like Debian or Geeks or Fedora, for example, are just targeting the baseline of the CPU, [03:02.800 --> 03:06.920] like A6664 without AVX, for example. [03:06.920 --> 03:09.360] And that's a problem for performance. [03:09.360 --> 03:13.040] Because of course, if you have that latest fancy Intel processor, then you probably want [03:13.040 --> 03:15.760] to use those vector instructions. [03:15.760 --> 03:22.440] But the conclusion that because of this, we cannot use, you know, binary distributions. [03:22.440 --> 03:26.960] Distributions like Geeks or Debian that provide binaries is not entirely accurate. [03:26.960 --> 03:30.840] That's the point I'm trying to make in this talk. [03:30.840 --> 03:35.760] So yeah, as most of you know, there's a whole bunch of vector extensions. [03:35.760 --> 03:40.720] It keeps growing, you know, like every, every few years we have new vector extensions in [03:40.720 --> 03:46.400] Intel or AMD CPUs or even AH64 CPUs, Power 9, et cetera. [03:46.400 --> 03:52.800] And it's even worse if you look at the actual CPU models, for example, this is just for [03:52.800 --> 03:55.360] Intel, there's a whole bunch of things. [03:55.360 --> 04:01.440] It's not always a superset of the previous CPU, you know, we're discussing it the other [04:01.440 --> 04:02.440] day for dinner. [04:02.440 --> 04:04.400] And yeah, sometimes it's complicated. [04:04.400 --> 04:09.280] You cannot tell that Skylake AVX is exactly a superset of Skylake. [04:09.280 --> 04:10.800] It's complicated. [04:10.800 --> 04:15.120] And yet you want to be able to target these CPUs specifically, these micro-architectures. [04:15.120 --> 04:17.680] And it makes a big deal of a difference. [04:17.680 --> 04:20.360] So this is an example from an Agen benchmark. [04:20.360 --> 04:27.520] So Agen is a C++ library for linear algebra, specifically targeting small matrices. [04:27.520 --> 04:32.840] And well, you know, if on my laptop, if I'm targeting, if I'm compiling with MR equals [04:32.840 --> 04:39.400] to Skylake, then I get a throughput that's three times the baseline performance. [04:39.400 --> 04:40.840] So it's a pretty big deal. [04:40.840 --> 04:42.440] So we definitely want to use that. [04:42.440 --> 04:49.080] We want to be able to compile specifically for the CPU micro-architecture that we have. [04:49.080 --> 04:55.240] But so the good news is that to a large extent, that's a solved problem for a long time. [04:55.240 --> 05:02.120] So there is this thing called function multi-versioning that is already used in a number of performance [05:02.120 --> 05:03.120] critical libraries. [05:03.120 --> 05:07.520] So if you look at LeapSea for string comparison, or if you look at OpenBLAST, if you look [05:07.520 --> 05:15.080] at FFTW, GMP for multi-precision arithmetic, you know, many libraries, programming languages, [05:15.080 --> 05:18.560] runtimes, already use function multi-versioning. [05:18.560 --> 05:19.960] So what's the deal here? [05:19.960 --> 05:24.880] Well, roughly when you have function multi-versioning, you can say, well, I have one function that [05:24.880 --> 05:30.520] does some linear algebra stuff, for example, and I'm actually providing several variants [05:30.520 --> 05:32.080] of that function. [05:32.080 --> 05:37.080] And when I start my program at runtime, the loader or, you know, the runtime system is [05:37.080 --> 05:41.400] going to pick the most optimized one for the CPU I have at hand, right? [05:41.400 --> 05:47.520] So if I use GMP, for example, for multi-precision arithmetic, it's going to pick the fastest [05:47.520 --> 05:49.880] implementation it has, you know. [05:49.880 --> 05:55.040] So you can compile GMP once, and then it's going to use the writing at runtime. [05:55.040 --> 06:00.840] And even if you're using GCC or Clang, you can specify in your C code, well, I want this [06:00.840 --> 06:07.040] particular function to be cloned, so to have several variants for each CPU microarchitectures, [06:07.040 --> 06:11.680] and GCC or Clang is going to create several variants of that function so that it can pick [06:11.680 --> 06:15.280] the right one at runtime. [06:15.280 --> 06:21.280] So kind of a solved problem, in a way, well, except in some cases. [06:21.280 --> 06:28.760] Well, one particular case where we have a problem is C++ template libraries, like Agen, [06:28.760 --> 06:34.520] which I was mentioning before, they are not able to benefit from function multi-versioning [06:34.520 --> 06:35.520] in any way. [06:35.520 --> 06:41.040] So when you compile your Agen benchmark, well, you really have to use mRch equals to Skylake, [06:41.040 --> 06:44.680] for example, if you were targeting a Skylake CPU. [06:44.680 --> 06:49.400] And this is because if you look at Agen headers, for example, where there are many places where [06:49.400 --> 06:54.760] you have if depths, do I have AVX 512 at compilation time? [06:54.760 --> 06:58.440] If yes, then I'm going to use the optimized implementation, otherwise, I'm going to use [06:58.440 --> 07:00.640] the baseline implementation. [07:00.640 --> 07:05.400] And this is all happening at compilation time, so you really have to have a solution at compilation [07:05.400 --> 07:08.360] time to address this. [07:08.360 --> 07:11.800] And so this is where Geeks comes in. [07:11.800 --> 07:16.360] So Geeks is, you know, it's a distribution, like Debian, like I was saying, that's targeting [07:16.360 --> 07:22.360] the baseline instruction set, but we came up with a new thing that's called package multi-versioning. [07:22.360 --> 07:27.840] It's actually one year old or something, which is roughly the idea is taking the same idea [07:27.840 --> 07:34.320] of function as function multi-versioning, but applying it at the level of entire packages. [07:34.320 --> 07:41.080] So let's say I have those Agen benchmarks, I can run them using just the baseline X8664 [07:41.080 --> 07:44.160] architecture, using this Geeks shell command. [07:44.160 --> 07:49.760] It's, you know, it's taking the Agen benchmarks package, and in that package running the Bench [07:49.760 --> 07:55.120] plus gem command, right, on a small matrix. [07:55.120 --> 08:01.160] And then I can say, all right, now I want to tune that code specifically for my CPU, [08:01.160 --> 08:08.240] and then I just put that extra, that extra dash dash tune option, and it's selling Geeks, [08:08.240 --> 08:15.200] all right, please optimize that Agen benchmarks package directly for the CPU I'm on, which [08:15.200 --> 08:19.240] is Skylake in this case, and this is it. [08:19.240 --> 08:23.560] And what happens behind the scenes is that on the flag, Geeks is creating a new package [08:23.560 --> 08:24.560] variant. [08:24.560 --> 08:29.800] So it's taking that Agen benchmarks package, creating a new package variant that is built [08:29.800 --> 08:36.960] specifically with a compiler wrapper that passes the MRT equals to Skylake flag. [08:36.960 --> 08:43.600] And I get the performance, and I'm happy, right, so this is in Geeks since 2022, and [08:43.600 --> 08:48.360] it's still reproducible, you know, because we can still say, all right, what precise option [08:48.360 --> 08:53.120] did I use, what dash dash tune option did I use, and it's Skylake, all right, so the [08:53.120 --> 08:57.960] build process of the package remains reproducible, right, I'm still getting the same binary if [08:57.960 --> 09:05.800] I use dash dash tune equals to Skylake on my laptop or on some HPC cluster, whatever. [09:05.800 --> 09:10.000] And there are no world rebuilds, which means that the build farm, for example, the official [09:10.000 --> 09:16.280] build farm of the project is providing several variants of those packages, those, you know, [09:16.280 --> 09:21.480] performance sensitive packages built for Skylake, Skylake, AVX, 512, you know, different things. [09:21.480 --> 09:26.560] So if you install them, most likely you're going to get a pre-built binary that's specifically [09:26.560 --> 09:28.640] optimized for that CPU. [09:28.640 --> 09:34.600] And if not, well, that's fine, it's going to be to build it for you, that's okay. [09:34.600 --> 09:42.200] So my conclusion here is, you know, we keep talking about performance of MPI, vector instruction [09:42.200 --> 09:43.200] and so forth. [09:43.200 --> 09:47.680] Well, I think we can have performance, we can have portable performance, that's what [09:47.680 --> 09:53.800] we should aim for, and we can still have reproducibility, we don't have to sacrifice reproducibility [09:53.800 --> 09:57.120] for performance, that's my take on the message. [09:57.120 --> 10:05.000] Thank you. [10:05.000 --> 10:06.000] Thank you very much. [10:06.000 --> 10:08.840] Again, time for one question. [10:08.840 --> 10:14.640] Okay, yeah, this whole dash tune thing looks awesome. [10:14.640 --> 10:18.840] But what if the majority of the computation time is spent in libraries that that library [10:18.840 --> 10:20.000] is actually using? [10:20.000 --> 10:22.960] How do I tell it to optimize those instead as well? [10:22.960 --> 10:23.960] Right. [10:23.960 --> 10:29.520] So the way it works in Geeks, you can annotate packages that really need to be tunable, right? [10:29.520 --> 10:33.680] So you can add a property to a package like, so it would be egg and benchmarks in this [10:33.680 --> 10:38.640] case where it could be the GNU Scientific Library, GSL, and you said this package needs [10:38.640 --> 10:43.920] to be tunable, so if I use dash, dash, tune, please tune specifically this package. [10:43.920 --> 10:48.040] And it's going to work even if you're installing, you know, an application that actually depends [10:48.040 --> 10:50.840] on GSL, for example. [10:50.840 --> 10:54.440] All right, thanks a lot Ludovic. [10:54.440 --> 10:55.440] Thank you. [10:55.440 --> 11:21.440] Thank you.