[00:00.000 --> 00:11.360]  Okay, final lightning talk for today is Ludovic talking about geeks.
[00:11.360 --> 00:14.000]  All right, thank you.
[00:14.000 --> 00:15.680]  Hello HPC people.
[00:15.680 --> 00:17.200]  So my name is Ludovic Cortes.
[00:17.200 --> 00:23.000]  I work at INRIER, which is a French research institute in France in computer science.
[00:23.000 --> 00:24.600]  And I work as a research engineer.
[00:24.600 --> 00:29.040]  So I'm very much concerned about engineering issues in general.
[00:29.040 --> 00:31.520]  And in particular, I'm concerned about deployment.
[00:31.520 --> 00:37.520]  So if you're an HPC dev room aficionado, we've probably made before.
[00:37.520 --> 00:43.160]  I gave a couple of talks, I guess, in this room, more specifically about geeks.
[00:43.160 --> 00:44.920]  So maybe you're afraid about geeks.
[00:44.920 --> 00:46.480]  It's a software deployment tool.
[00:46.480 --> 00:52.040]  So we have Easy Builds Pack, also RPM, well, you know, app, et cetera.
[00:52.040 --> 00:55.360]  And this is yet another deployment tool, if you will.
[00:55.360 --> 00:59.600]  But we have this very particular vision where, you know, the grand vision where we're trying
[00:59.600 --> 01:04.440]  to build a tool for reproducible research and HPC.
[01:04.440 --> 01:07.600]  So the thing here that you see is the vision, so to speak.
[01:07.600 --> 01:12.000]  So at one end of the spectrum, we have, you know, research articles and we want the research
[01:12.000 --> 01:13.480]  to be solid.
[01:13.480 --> 01:17.000]  So we want the computational workflows to be reproducible.
[01:17.000 --> 01:21.080]  And at the other end of the spectrum on the left, we have archives, source code archives
[01:21.080 --> 01:25.840]  like software heritage, which we really need to have if we want that scientific source
[01:25.840 --> 01:29.320]  code to, you know, to remain available over time.
[01:29.320 --> 01:34.000]  And in the middle, while we need a bunch of tools, in particular, deployment tool like
[01:34.000 --> 01:37.960]  geeks to reproduce, well, to deploy software reproducibly.
[01:37.960 --> 01:39.720]  Yes.
[01:39.720 --> 01:44.920]  So in a nutshell, yes, geeks provides actual tools for reproducible research people.
[01:44.920 --> 01:49.400]  I'm not going to go into details, but basically you can say, all right, I've made an experiment,
[01:49.400 --> 01:50.800]  a computational experiment.
[01:50.800 --> 01:54.200]  So now I'm going to pin the exact revision of geeks that I used.
[01:54.200 --> 01:56.360]  This is the first command here.
[01:56.360 --> 02:00.960]  And the second command is, you know, some time later or some colleague wants to reproduce
[02:00.960 --> 02:02.120]  the results.
[02:02.120 --> 02:06.000]  And so they use the time machine to jump to that specific revision of geeks.
[02:06.000 --> 02:10.560]  And from there, they deploy the exact same packages that I have done in that manifest
[02:10.560 --> 02:12.960]  file, bit for bit.
[02:12.960 --> 02:14.480]  That's the idea.
[02:14.480 --> 02:15.720]  All right.
[02:15.720 --> 02:19.680]  So in HPC, I guess most people in this room would agree.
[02:19.680 --> 02:22.520]  We have two obsessions.
[02:22.520 --> 02:26.880]  That's MPI and AVX, well, vector instructions.
[02:26.880 --> 02:28.640]  We want things to run fast, right?
[02:28.640 --> 02:29.960]  We have those fancy clusters.
[02:29.960 --> 02:33.200]  So we want to make sure that the communications are going to be fast.
[02:33.200 --> 02:37.600]  We want to make sure we're going to use the latest vector instructions of our CPUs.
[02:37.600 --> 02:39.720]  And that makes a lot of sense.
[02:39.720 --> 02:45.800]  But sometimes we're going, maybe we have preconceptions about the implications of all this.
[02:45.800 --> 02:49.920]  So here I'm creating Todd Gamblin, who's maybe in this room actually.
[02:49.920 --> 02:52.400]  Hi, Todd, if you see me.
[02:52.400 --> 02:57.120]  This is an example where we, well, Todd here was saying, you know, binaries, distributions
[02:57.120 --> 03:02.800]  like Debian or Geeks or Fedora, for example, are just targeting the baseline of the CPU,
[03:02.800 --> 03:06.920]  like A6664 without AVX, for example.
[03:06.920 --> 03:09.360]  And that's a problem for performance.
[03:09.360 --> 03:13.040]  Because of course, if you have that latest fancy Intel processor, then you probably want
[03:13.040 --> 03:15.760]  to use those vector instructions.
[03:15.760 --> 03:22.440]  But the conclusion that because of this, we cannot use, you know, binary distributions.
[03:22.440 --> 03:26.960]  Distributions like Geeks or Debian that provide binaries is not entirely accurate.
[03:26.960 --> 03:30.840]  That's the point I'm trying to make in this talk.
[03:30.840 --> 03:35.760]  So yeah, as most of you know, there's a whole bunch of vector extensions.
[03:35.760 --> 03:40.720]  It keeps growing, you know, like every, every few years we have new vector extensions in
[03:40.720 --> 03:46.400]  Intel or AMD CPUs or even AH64 CPUs, Power 9, et cetera.
[03:46.400 --> 03:52.800]  And it's even worse if you look at the actual CPU models, for example, this is just for
[03:52.800 --> 03:55.360]  Intel, there's a whole bunch of things.
[03:55.360 --> 04:01.440]  It's not always a superset of the previous CPU, you know, we're discussing it the other
[04:01.440 --> 04:02.440]  day for dinner.
[04:02.440 --> 04:04.400]  And yeah, sometimes it's complicated.
[04:04.400 --> 04:09.280]  You cannot tell that Skylake AVX is exactly a superset of Skylake.
[04:09.280 --> 04:10.800]  It's complicated.
[04:10.800 --> 04:15.120]  And yet you want to be able to target these CPUs specifically, these micro-architectures.
[04:15.120 --> 04:17.680]  And it makes a big deal of a difference.
[04:17.680 --> 04:20.360]  So this is an example from an Agen benchmark.
[04:20.360 --> 04:27.520]  So Agen is a C++ library for linear algebra, specifically targeting small matrices.
[04:27.520 --> 04:32.840]  And well, you know, if on my laptop, if I'm targeting, if I'm compiling with MR equals
[04:32.840 --> 04:39.400]  to Skylake, then I get a throughput that's three times the baseline performance.
[04:39.400 --> 04:40.840]  So it's a pretty big deal.
[04:40.840 --> 04:42.440]  So we definitely want to use that.
[04:42.440 --> 04:49.080]  We want to be able to compile specifically for the CPU micro-architecture that we have.
[04:49.080 --> 04:55.240]  But so the good news is that to a large extent, that's a solved problem for a long time.
[04:55.240 --> 05:02.120]  So there is this thing called function multi-versioning that is already used in a number of performance
[05:02.120 --> 05:03.120]  critical libraries.
[05:03.120 --> 05:07.520]  So if you look at LeapSea for string comparison, or if you look at OpenBLAST, if you look
[05:07.520 --> 05:15.080]  at FFTW, GMP for multi-precision arithmetic, you know, many libraries, programming languages,
[05:15.080 --> 05:18.560]  runtimes, already use function multi-versioning.
[05:18.560 --> 05:19.960]  So what's the deal here?
[05:19.960 --> 05:24.880]  Well, roughly when you have function multi-versioning, you can say, well, I have one function that
[05:24.880 --> 05:30.520]  does some linear algebra stuff, for example, and I'm actually providing several variants
[05:30.520 --> 05:32.080]  of that function.
[05:32.080 --> 05:37.080]  And when I start my program at runtime, the loader or, you know, the runtime system is
[05:37.080 --> 05:41.400]  going to pick the most optimized one for the CPU I have at hand, right?
[05:41.400 --> 05:47.520]  So if I use GMP, for example, for multi-precision arithmetic, it's going to pick the fastest
[05:47.520 --> 05:49.880]  implementation it has, you know.
[05:49.880 --> 05:55.040]  So you can compile GMP once, and then it's going to use the writing at runtime.
[05:55.040 --> 06:00.840]  And even if you're using GCC or Clang, you can specify in your C code, well, I want this
[06:00.840 --> 06:07.040]  particular function to be cloned, so to have several variants for each CPU microarchitectures,
[06:07.040 --> 06:11.680]  and GCC or Clang is going to create several variants of that function so that it can pick
[06:11.680 --> 06:15.280]  the right one at runtime.
[06:15.280 --> 06:21.280]  So kind of a solved problem, in a way, well, except in some cases.
[06:21.280 --> 06:28.760]  Well, one particular case where we have a problem is C++ template libraries, like Agen,
[06:28.760 --> 06:34.520]  which I was mentioning before, they are not able to benefit from function multi-versioning
[06:34.520 --> 06:35.520]  in any way.
[06:35.520 --> 06:41.040]  So when you compile your Agen benchmark, well, you really have to use mRch equals to Skylake,
[06:41.040 --> 06:44.680]  for example, if you were targeting a Skylake CPU.
[06:44.680 --> 06:49.400]  And this is because if you look at Agen headers, for example, where there are many places where
[06:49.400 --> 06:54.760]  you have if depths, do I have AVX 512 at compilation time?
[06:54.760 --> 06:58.440]  If yes, then I'm going to use the optimized implementation, otherwise, I'm going to use
[06:58.440 --> 07:00.640]  the baseline implementation.
[07:00.640 --> 07:05.400]  And this is all happening at compilation time, so you really have to have a solution at compilation
[07:05.400 --> 07:08.360]  time to address this.
[07:08.360 --> 07:11.800]  And so this is where Geeks comes in.
[07:11.800 --> 07:16.360]  So Geeks is, you know, it's a distribution, like Debian, like I was saying, that's targeting
[07:16.360 --> 07:22.360]  the baseline instruction set, but we came up with a new thing that's called package multi-versioning.
[07:22.360 --> 07:27.840]  It's actually one year old or something, which is roughly the idea is taking the same idea
[07:27.840 --> 07:34.320]  of function as function multi-versioning, but applying it at the level of entire packages.
[07:34.320 --> 07:41.080]  So let's say I have those Agen benchmarks, I can run them using just the baseline X8664
[07:41.080 --> 07:44.160]  architecture, using this Geeks shell command.
[07:44.160 --> 07:49.760]  It's, you know, it's taking the Agen benchmarks package, and in that package running the Bench
[07:49.760 --> 07:55.120]  plus gem command, right, on a small matrix.
[07:55.120 --> 08:01.160]  And then I can say, all right, now I want to tune that code specifically for my CPU,
[08:01.160 --> 08:08.240]  and then I just put that extra, that extra dash dash tune option, and it's selling Geeks,
[08:08.240 --> 08:15.200]  all right, please optimize that Agen benchmarks package directly for the CPU I'm on, which
[08:15.200 --> 08:19.240]  is Skylake in this case, and this is it.
[08:19.240 --> 08:23.560]  And what happens behind the scenes is that on the flag, Geeks is creating a new package
[08:23.560 --> 08:24.560]  variant.
[08:24.560 --> 08:29.800]  So it's taking that Agen benchmarks package, creating a new package variant that is built
[08:29.800 --> 08:36.960]  specifically with a compiler wrapper that passes the MRT equals to Skylake flag.
[08:36.960 --> 08:43.600]  And I get the performance, and I'm happy, right, so this is in Geeks since 2022, and
[08:43.600 --> 08:48.360]  it's still reproducible, you know, because we can still say, all right, what precise option
[08:48.360 --> 08:53.120]  did I use, what dash dash tune option did I use, and it's Skylake, all right, so the
[08:53.120 --> 08:57.960]  build process of the package remains reproducible, right, I'm still getting the same binary if
[08:57.960 --> 09:05.800]  I use dash dash tune equals to Skylake on my laptop or on some HPC cluster, whatever.
[09:05.800 --> 09:10.000]  And there are no world rebuilds, which means that the build farm, for example, the official
[09:10.000 --> 09:16.280]  build farm of the project is providing several variants of those packages, those, you know,
[09:16.280 --> 09:21.480]  performance sensitive packages built for Skylake, Skylake, AVX, 512, you know, different things.
[09:21.480 --> 09:26.560]  So if you install them, most likely you're going to get a pre-built binary that's specifically
[09:26.560 --> 09:28.640]  optimized for that CPU.
[09:28.640 --> 09:34.600]  And if not, well, that's fine, it's going to be to build it for you, that's okay.
[09:34.600 --> 09:42.200]  So my conclusion here is, you know, we keep talking about performance of MPI, vector instruction
[09:42.200 --> 09:43.200]  and so forth.
[09:43.200 --> 09:47.680]  Well, I think we can have performance, we can have portable performance, that's what
[09:47.680 --> 09:53.800]  we should aim for, and we can still have reproducibility, we don't have to sacrifice reproducibility
[09:53.800 --> 09:57.120]  for performance, that's my take on the message.
[09:57.120 --> 10:05.000]  Thank you.
[10:05.000 --> 10:06.000]  Thank you very much.
[10:06.000 --> 10:08.840]  Again, time for one question.
[10:08.840 --> 10:14.640]  Okay, yeah, this whole dash tune thing looks awesome.
[10:14.640 --> 10:18.840]  But what if the majority of the computation time is spent in libraries that that library
[10:18.840 --> 10:20.000]  is actually using?
[10:20.000 --> 10:22.960]  How do I tell it to optimize those instead as well?
[10:22.960 --> 10:23.960]  Right.
[10:23.960 --> 10:29.520]  So the way it works in Geeks, you can annotate packages that really need to be tunable, right?
[10:29.520 --> 10:33.680]  So you can add a property to a package like, so it would be egg and benchmarks in this
[10:33.680 --> 10:38.640]  case where it could be the GNU Scientific Library, GSL, and you said this package needs
[10:38.640 --> 10:43.920]  to be tunable, so if I use dash, dash, tune, please tune specifically this package.
[10:43.920 --> 10:48.040]  And it's going to work even if you're installing, you know, an application that actually depends
[10:48.040 --> 10:50.840]  on GSL, for example.
[10:50.840 --> 10:54.440]  All right, thanks a lot Ludovic.
[10:54.440 --> 10:55.440]  Thank you.
[10:55.440 --> 11:21.440]  Thank you.