[00:00.000 --> 00:18.120]  Alright, we're going to start the next talk. If you're going to stay in the room, please
[00:18.120 --> 00:35.640]  take a seat. If you want to leave, please leave. Okay, so next speaker is Michele, who's
[00:35.640 --> 00:51.320]  going to talk about a universal sparse glass library. So, yes, at the core of many technical
[00:51.320 --> 00:58.440]  or scientific computing problems, we end up, we reduce the problem to solving a system
[00:58.440 --> 01:04.040]  of linear equations. If the system of linear equations were a simple one, like a two by
[01:04.040 --> 01:17.360]  two one, the method of solving would be pretty simple. And in any case, it would involve representing
[01:17.360 --> 01:25.240]  the linear systems by the data structure of a matrix, so a table of symbols or usually
[01:25.240 --> 01:34.240]  numbers and a few vectors of numbers. So, the matrix is the basic structure of scientific
[01:34.240 --> 01:44.440]  computing. In the case of such toy problems or school problems, we have exact direct solutions
[01:44.440 --> 01:54.680]  at our disposal, which works fine. However, once we go into the problems involving simulation
[01:54.680 --> 02:01.920]  of larger domains, so engineering problems, those linear systems to be solved become
[02:01.920 --> 02:12.680]  large. And also, the methods that we use for smaller systems are not applyable here anymore
[02:12.680 --> 02:20.080]  because the numerical stability of, let's say, toy problems or small problems, the stability
[02:20.080 --> 02:27.640]  is not here anymore. Simply, those methods, numbers, results, they verge. And the time
[02:27.640 --> 02:35.040]  to solution also increases more than exponentially. So, they're simply infeasible and don't make
[02:35.040 --> 02:44.480]  sense. So, it's a different, it was completed different techniques for large linear systems.
[02:44.480 --> 02:56.040]  So, furthermore, if the systems were not only large, but also full of zeros in the matrices,
[02:56.040 --> 03:02.160]  so how do we call the systems or what do we have to do with here? We have to do perhaps
[03:02.160 --> 03:10.600]  with sparse systems or sparse problems. This is the way we call it. So, in this acoustics
[03:10.600 --> 03:20.800]  matrix or matrix coming from acoustics, we observe that less than half percent of each
[03:20.800 --> 03:31.280]  row on the average has a non-zero element. So, we would call this sparse systems, perhaps,
[03:31.280 --> 03:38.960]  so the system coming from this matrix. Indeed, usually we use, we are happy with the definition
[03:38.960 --> 03:49.160]  of Jim Wilkinson, where we say a problem or a matrix is sparse. If we can, with our technology,
[03:49.160 --> 03:57.080]  which our technique, we can make use of the amount of zeros there to our advantage. So,
[03:57.080 --> 04:02.320]  this is the definition. It's not about numbers. It's really about what we are able to do with
[04:02.320 --> 04:10.520]  the way the matrix looks like. So, among the different matrices we can encounter, we could
[04:10.520 --> 04:18.120]  have matrices from a circuit simulation, which looks like this, and have such clustered elements
[04:18.120 --> 04:25.120]  in them. Sometimes the elements are more clustered around the diagonal, like in this quantum,
[04:25.120 --> 04:32.760]  I think quantum chromodynamics, I think, matrix. Computational fluid dynamics matrices a bit
[04:32.760 --> 04:38.480]  more regular, I could say. So, it means that you can exploit all of those matrices, perhaps,
[04:38.480 --> 04:43.400]  in different ways. This is what I'm showing you, this gallery. This is another CFD, so
[04:43.400 --> 04:53.400]  computational fluid dynamics matrix, a structural matrix, another material problem matrix, structural
[04:53.400 --> 05:01.320]  and so on. This is also CFD1. So, this was just to tell you that sparsity really is related
[05:01.320 --> 05:09.520]  to the technologies, the technology we use to deal with it. So, usually, we are happy
[05:09.520 --> 05:15.520]  using iterative methods with sparse systems. Iterative methods, because something is being
[05:15.520 --> 05:23.640]  iterated. So, there is a loop, and with the most common methods, Krilov methods, the loop
[05:23.640 --> 05:32.440]  usually has a bottleneck, has a core operation, which is prominently multiplying the system
[05:32.440 --> 05:39.480]  matrix by one vector or many vectors. It depends a bit on the technique. There are several
[05:39.480 --> 05:47.000]  of them. Here, I'm showing a new octave implementation of such one iterative method. So, there are
[05:47.000 --> 05:53.680]  two kernel operations, or main operations, multiplication of the matrix by many vectors,
[05:53.680 --> 06:00.280]  or let's say, another dense matrix, or the triangular solve, so the solving of matrices
[06:00.280 --> 06:07.920]  which are called preconditioner matrices, but are spars. And these are the core operations
[06:07.920 --> 06:13.680]  which we are interested in. And I want to mention that those operations for the sparse
[06:13.680 --> 06:21.680]  matrix vector or multi-vector operation can have many variants. The variants can be on
[06:21.680 --> 06:29.720]  the matrix, which could be perhaps complex and Hermitian, or symmetric. It doesn't have
[06:29.720 --> 06:37.840]  always to be square. It could be any rectangular. And perhaps it has already a unit diagonal,
[06:37.840 --> 06:47.840]  and we can exploit this. This is what I'm saying. Many things change if the matrix has
[06:47.840 --> 06:55.240]  a complex numbers, or long complex numbers like a speaker before me spoke about. So,
[06:55.240 --> 07:01.160]  that changes the balances in the performance profile here. And other things might change.
[07:01.160 --> 07:07.800]  And all of this have influence on the specific kernels. And if you think like Ludovic has
[07:07.800 --> 07:13.600]  spoken about the different variants that one might want to build over different architectures,
[07:13.600 --> 07:18.400]  you see that this is, you end up with code bloat if you really want to optimize each
[07:18.400 --> 07:27.200]  subcase. Also, the operands have their own variants. So, in the way the data are laid
[07:27.200 --> 07:35.560]  in the dense matrix. Yeah. Similarly, for the triangular solve operation, there also
[07:35.560 --> 07:41.000]  you have different variants, which lead to a multitude of different kernels or ways
[07:41.000 --> 07:47.880]  you wish to write them, kernels of code. So, this leads to a committee of people, end of
[07:47.880 --> 07:54.760]  the 90s, to meet together. It was mostly US people, but also from delegations from Europe
[07:54.760 --> 08:04.280]  to standardize an API, which they called sparse blasts, sparse basically algebra subroutines,
[08:04.280 --> 08:10.520]  to somehow just give an API to the different variations of the operations that I spoke
[08:10.520 --> 08:15.480]  about. So, it's mostly, it's not like full blast if you know the dense blast. It's mostly
[08:15.480 --> 08:20.680]  about creating sparse matrices, destroying them, and doing a few operations, not only
[08:20.680 --> 08:26.680]  those one, but these are really the core operations. And they talked about C and Fortran, because
[08:26.680 --> 08:33.120]  the, yeah, 20 years ago, 20 something years ago was the final document which they finalized.
[08:33.120 --> 08:38.880]  Now, after 20 years, we could say that, well, what they've wrote, especially this is especially
[08:38.880 --> 08:44.920]  in my opinion, is perfectly portable, allows some parallelization, even if it's not specified
[08:44.920 --> 08:53.560]  at all. They didn't foresee extensions, but it's possible. If you look at the API, you
[08:53.560 --> 09:01.400]  see that you can have extensions. So, they're not blocked somehow. The namesake of sparse
[09:01.400 --> 09:06.880]  blast has been copied by every major vendor you can imagine. The sad thing that each major
[09:06.880 --> 09:12.120]  vendor has completely violated their API. So, they changed something in a slightly incompatible
[09:12.120 --> 09:18.480]  way, which is sad, simply sad. And the original sparse blast didn't think about the GPUs,
[09:18.480 --> 09:23.080]  but actually, in my experience, looking at how people program code, I see so much technical
[09:23.080 --> 09:29.880]  depth that I think you can do compromises. And with small adaptions, you could adapt
[09:29.880 --> 09:39.160]  the sparse blast to the GPU, to the GPU computations to some extent. So, I think you can save this
[09:39.160 --> 09:46.080]  API to a good extent. And this is the reason why I'm here. So, I wrote a library which
[09:46.080 --> 09:52.240]  respects the original sparse blast. So, I see sparse blast program can look like this,
[09:52.240 --> 09:58.720]  where you have a notation for the sparse blast operations, which is logical if you know
[09:58.720 --> 10:05.400]  blasts a bit, so you can understand it a bit. And going in the direction of my library,
[10:05.400 --> 10:12.640]  it's centered, it's around a data format, a sparse matrix format, which I came up with.
[10:12.640 --> 10:17.680]  It's called recursive sparse blocks, because there is a recursive subdivisions. There are
[10:17.680 --> 10:25.040]  blocks which are sparse. And the reason, the motivation for this data structure is to not
[10:25.040 --> 10:31.360]  exclude the sparse blast operations. So, I have made compromises in order to allow sparse
[10:31.360 --> 10:39.480]  blast operations to be there. I didn't want to preclude these operations. So, it's a compromise.
[10:39.480 --> 10:47.000]  And the core idea here is to have, let's say, cache size blocks, more or less, and a way
[10:47.000 --> 10:55.160]  to give each multi-core core something to work with. So, it's oriented towards multi-core.
[10:55.160 --> 11:01.480]  It's not for GPUs, or not at the moment, at least. So, the matrices which you have seen
[11:01.480 --> 11:07.520]  before, with this data format, the data structure looks a bit like this. The color is based
[11:07.520 --> 11:15.760]  on the population, on the amount of matrices are there. Then there is another core coding
[11:15.760 --> 11:20.080]  with other information. But this is just to tell you that the irregular aspect of those
[11:20.080 --> 11:33.160]  matrices is reflected also here, to some extent. Yeah. So, the library itself wants to provide
[11:33.160 --> 11:41.360]  sparse blast. So, building blocks for iterative solvers. It's pretty compatible at the library.
[11:41.360 --> 11:49.120]  It works with C++, Fortune, Octave, and Python. I say it's quite compatible. So, it uses,
[11:49.120 --> 11:54.240]  let's say, established technologies. And it's quite compatible also in the sense with your
[11:54.240 --> 11:58.600]  software. It doesn't require you to use the only data structure which is custom is the
[11:58.600 --> 12:07.880]  matrix. You don't need extra data structures for vectors. And the program I saw before
[12:07.880 --> 12:14.760]  written in the sparse blast, for using RSV, uses just one extra init and finalized function.
[12:14.760 --> 12:20.720]  So, I really respect that API. But, however, it's nice to write also the 15th standard.
[12:20.720 --> 12:28.440]  Or joking. This is not the 15th standard, but just the internal API. So, if you want,
[12:28.440 --> 12:33.360]  perhaps you can exploit the internal API of Libre Sb, or not internal, but the native
[12:33.360 --> 12:42.640]  one. Please tell me when I'm at 10 minutes. Yeah. Which is primarily in C. Then you have
[12:42.640 --> 12:53.640]  wrappers with C++. And there's also the Fortune one. These are the native APIs. And what is
[12:53.640 --> 13:02.760]  specific about RSV is that the blocking is not so clear which blocking is best. Because,
[13:02.760 --> 13:11.240]  yeah, depending on how you block, you could have better or worse performance. And for
[13:11.240 --> 13:17.560]  this reason, there is an idea of using automated empirical optimization in this library. There
[13:17.560 --> 13:26.720]  is a special call, a function which you call when you invest time to ask the library to
[13:26.720 --> 13:34.720]  optimize a bit the data structure. So, you sacrifice a minute, perhaps, for optimizing
[13:34.720 --> 13:41.440]  the data structure a bit. And you do this in the hope that the many hours which we'll
[13:41.440 --> 13:49.760]  be using this matrix afterwards will be, will profit, will be decreased, thanks to the optimization.
[13:49.760 --> 13:54.720]  Because, as I said, this is meant to be used for iterative methods. So, you will be running
[13:54.720 --> 14:00.360]  this for many hours. And, therefore, spending a few minutes in automated optimization, it's
[14:00.360 --> 14:06.480]  something that should pay off. No guarantee, but that's the idea and that is usually how
[14:06.480 --> 14:15.440]  it goes. To give an idea, this C++ API is what you would expect. So, there is a class
[14:15.440 --> 14:21.640]  templated on the type. So, there is type safety here. When you say, this is my library, sorry,
[14:21.640 --> 14:28.120]  this is my matrix. These are my non-zeros because this is what we are representing here.
[14:28.120 --> 14:38.160]  We, you have flags, C-style flags for options like symmetry or asking for discarding zeros
[14:38.160 --> 14:43.080]  rather than keeping the zeros because sometimes you want to keep structural zeros for modifying
[14:43.080 --> 14:47.560]  them later, for instance. So, you have many such options here. And this is the way, this
[14:47.560 --> 14:50.400]  is why I'm showing this slide to tell you that there are many options which I'm not
[14:50.400 --> 14:55.400]  showing you here. Yeah. And the only data structure here is the RSB matrix, no other
[14:55.400 --> 15:02.880]  custom stuff. And you can exploit, you can enjoy the spam interface of C++ 20 that doesn't
[15:02.880 --> 15:10.480]  really force you to have any weird custom vector type apart from the standard C++ ones.
[15:10.480 --> 15:18.360]  If you want to use, for instance, GNU Octave and enjoy the multi-core speedup from Libar
[15:18.360 --> 15:28.440]  SB, you can use the sparse RSB plug-in which I wrote, which uses C++ Libar SB pretty efficiently.
[15:28.440 --> 15:35.400]  So apart from a few conversions, it should be, it should have almost native performance.
[15:35.400 --> 15:47.040]  Similarly for Python, the PyRSB plug-in for standalone, sorry, package has an interface
[15:47.040 --> 15:57.680]  which is copied from CSR matrix. So you use it mostly the same way. But underneath, Libar
[15:57.680 --> 16:06.480]  SB runs. You don't see it. Or you see it if you ask it to use the auto-tuning routine.
[16:06.480 --> 16:13.040]  Because as I said, in all of those language implementations, you can also use all of the
[16:13.040 --> 16:19.360]  functionality of Libar SB which includes the auto-tuning also here in Octave. And I want
[16:19.360 --> 16:25.880]  to stress this. GNU Octave doesn't have multi-threaded sparse operations. With Libar SB, you can
[16:25.880 --> 16:33.400]  have them. Same for SciPy sparse. As far as I know, it's not multi-threaded. With Libar
[16:33.400 --> 16:43.920]  SB, you get it. Libar SB is by default licensed as lesser GPL3. Which means you can, if you
[16:43.920 --> 16:52.280]  don't, as long as you don't modify it, you can distribute it with your proprietary code.
[16:52.280 --> 16:59.520]  If you modify it, well, it's more complicated. You have to release the modified version.
[16:59.520 --> 17:06.600]  The Libar SB library, if you want to learn to use it, it makes absolutely sense to use
[17:06.600 --> 17:15.720]  a packaged version from Debian Ubuntu or most of Linux distributions. Or if you use Windows
[17:15.720 --> 17:23.320]  and you can use Siegwin. Or once you want the performance, I mean, you can just compile
[17:23.320 --> 17:28.800]  it by yourself because it's quite trivial. Or enjoy what our colleagues here from Spark
[17:28.800 --> 17:36.280]  and EasyBuild have done and use the packaged version from those distributions. And some
[17:36.280 --> 17:45.200]  people have written wrappers for Rust and Julia. I don't know these languages, so I
[17:45.200 --> 17:53.560]  didn't use them. I think the Rust one is like the entire API. I think Julia is more in Julia
[17:53.560 --> 18:03.200]  style, so it's just what is, the core functionality is there, I think. Yeah, that was everything.
[18:03.200 --> 18:24.680]  I don't know how much time did I take. Oh, 50 minutes. So, thanks.