[00:00.000 --> 00:18.120] Alright, we're going to start the next talk. If you're going to stay in the room, please [00:18.120 --> 00:35.640] take a seat. If you want to leave, please leave. Okay, so next speaker is Michele, who's [00:35.640 --> 00:51.320] going to talk about a universal sparse glass library. So, yes, at the core of many technical [00:51.320 --> 00:58.440] or scientific computing problems, we end up, we reduce the problem to solving a system [00:58.440 --> 01:04.040] of linear equations. If the system of linear equations were a simple one, like a two by [01:04.040 --> 01:17.360] two one, the method of solving would be pretty simple. And in any case, it would involve representing [01:17.360 --> 01:25.240] the linear systems by the data structure of a matrix, so a table of symbols or usually [01:25.240 --> 01:34.240] numbers and a few vectors of numbers. So, the matrix is the basic structure of scientific [01:34.240 --> 01:44.440] computing. In the case of such toy problems or school problems, we have exact direct solutions [01:44.440 --> 01:54.680] at our disposal, which works fine. However, once we go into the problems involving simulation [01:54.680 --> 02:01.920] of larger domains, so engineering problems, those linear systems to be solved become [02:01.920 --> 02:12.680] large. And also, the methods that we use for smaller systems are not applyable here anymore [02:12.680 --> 02:20.080] because the numerical stability of, let's say, toy problems or small problems, the stability [02:20.080 --> 02:27.640] is not here anymore. Simply, those methods, numbers, results, they verge. And the time [02:27.640 --> 02:35.040] to solution also increases more than exponentially. So, they're simply infeasible and don't make [02:35.040 --> 02:44.480] sense. So, it's a different, it was completed different techniques for large linear systems. [02:44.480 --> 02:56.040] So, furthermore, if the systems were not only large, but also full of zeros in the matrices, [02:56.040 --> 03:02.160] so how do we call the systems or what do we have to do with here? We have to do perhaps [03:02.160 --> 03:10.600] with sparse systems or sparse problems. This is the way we call it. So, in this acoustics [03:10.600 --> 03:20.800] matrix or matrix coming from acoustics, we observe that less than half percent of each [03:20.800 --> 03:31.280] row on the average has a non-zero element. So, we would call this sparse systems, perhaps, [03:31.280 --> 03:38.960] so the system coming from this matrix. Indeed, usually we use, we are happy with the definition [03:38.960 --> 03:49.160] of Jim Wilkinson, where we say a problem or a matrix is sparse. If we can, with our technology, [03:49.160 --> 03:57.080] which our technique, we can make use of the amount of zeros there to our advantage. So, [03:57.080 --> 04:02.320] this is the definition. It's not about numbers. It's really about what we are able to do with [04:02.320 --> 04:10.520] the way the matrix looks like. So, among the different matrices we can encounter, we could [04:10.520 --> 04:18.120] have matrices from a circuit simulation, which looks like this, and have such clustered elements [04:18.120 --> 04:25.120] in them. Sometimes the elements are more clustered around the diagonal, like in this quantum, [04:25.120 --> 04:32.760] I think quantum chromodynamics, I think, matrix. Computational fluid dynamics matrices a bit [04:32.760 --> 04:38.480] more regular, I could say. So, it means that you can exploit all of those matrices, perhaps, [04:38.480 --> 04:43.400] in different ways. This is what I'm showing you, this gallery. This is another CFD, so [04:43.400 --> 04:53.400] computational fluid dynamics matrix, a structural matrix, another material problem matrix, structural [04:53.400 --> 05:01.320] and so on. This is also CFD1. So, this was just to tell you that sparsity really is related [05:01.320 --> 05:09.520] to the technologies, the technology we use to deal with it. So, usually, we are happy [05:09.520 --> 05:15.520] using iterative methods with sparse systems. Iterative methods, because something is being [05:15.520 --> 05:23.640] iterated. So, there is a loop, and with the most common methods, Krilov methods, the loop [05:23.640 --> 05:32.440] usually has a bottleneck, has a core operation, which is prominently multiplying the system [05:32.440 --> 05:39.480] matrix by one vector or many vectors. It depends a bit on the technique. There are several [05:39.480 --> 05:47.000] of them. Here, I'm showing a new octave implementation of such one iterative method. So, there are [05:47.000 --> 05:53.680] two kernel operations, or main operations, multiplication of the matrix by many vectors, [05:53.680 --> 06:00.280] or let's say, another dense matrix, or the triangular solve, so the solving of matrices [06:00.280 --> 06:07.920] which are called preconditioner matrices, but are spars. And these are the core operations [06:07.920 --> 06:13.680] which we are interested in. And I want to mention that those operations for the sparse [06:13.680 --> 06:21.680] matrix vector or multi-vector operation can have many variants. The variants can be on [06:21.680 --> 06:29.720] the matrix, which could be perhaps complex and Hermitian, or symmetric. It doesn't have [06:29.720 --> 06:37.840] always to be square. It could be any rectangular. And perhaps it has already a unit diagonal, [06:37.840 --> 06:47.840] and we can exploit this. This is what I'm saying. Many things change if the matrix has [06:47.840 --> 06:55.240] a complex numbers, or long complex numbers like a speaker before me spoke about. So, [06:55.240 --> 07:01.160] that changes the balances in the performance profile here. And other things might change. [07:01.160 --> 07:07.800] And all of this have influence on the specific kernels. And if you think like Ludovic has [07:07.800 --> 07:13.600] spoken about the different variants that one might want to build over different architectures, [07:13.600 --> 07:18.400] you see that this is, you end up with code bloat if you really want to optimize each [07:18.400 --> 07:27.200] subcase. Also, the operands have their own variants. So, in the way the data are laid [07:27.200 --> 07:35.560] in the dense matrix. Yeah. Similarly, for the triangular solve operation, there also [07:35.560 --> 07:41.000] you have different variants, which lead to a multitude of different kernels or ways [07:41.000 --> 07:47.880] you wish to write them, kernels of code. So, this leads to a committee of people, end of [07:47.880 --> 07:54.760] the 90s, to meet together. It was mostly US people, but also from delegations from Europe [07:54.760 --> 08:04.280] to standardize an API, which they called sparse blasts, sparse basically algebra subroutines, [08:04.280 --> 08:10.520] to somehow just give an API to the different variations of the operations that I spoke [08:10.520 --> 08:15.480] about. So, it's mostly, it's not like full blast if you know the dense blast. It's mostly [08:15.480 --> 08:20.680] about creating sparse matrices, destroying them, and doing a few operations, not only [08:20.680 --> 08:26.680] those one, but these are really the core operations. And they talked about C and Fortran, because [08:26.680 --> 08:33.120] the, yeah, 20 years ago, 20 something years ago was the final document which they finalized. [08:33.120 --> 08:38.880] Now, after 20 years, we could say that, well, what they've wrote, especially this is especially [08:38.880 --> 08:44.920] in my opinion, is perfectly portable, allows some parallelization, even if it's not specified [08:44.920 --> 08:53.560] at all. They didn't foresee extensions, but it's possible. If you look at the API, you [08:53.560 --> 09:01.400] see that you can have extensions. So, they're not blocked somehow. The namesake of sparse [09:01.400 --> 09:06.880] blast has been copied by every major vendor you can imagine. The sad thing that each major [09:06.880 --> 09:12.120] vendor has completely violated their API. So, they changed something in a slightly incompatible [09:12.120 --> 09:18.480] way, which is sad, simply sad. And the original sparse blast didn't think about the GPUs, [09:18.480 --> 09:23.080] but actually, in my experience, looking at how people program code, I see so much technical [09:23.080 --> 09:29.880] depth that I think you can do compromises. And with small adaptions, you could adapt [09:29.880 --> 09:39.160] the sparse blast to the GPU, to the GPU computations to some extent. So, I think you can save this [09:39.160 --> 09:46.080] API to a good extent. And this is the reason why I'm here. So, I wrote a library which [09:46.080 --> 09:52.240] respects the original sparse blast. So, I see sparse blast program can look like this, [09:52.240 --> 09:58.720] where you have a notation for the sparse blast operations, which is logical if you know [09:58.720 --> 10:05.400] blasts a bit, so you can understand it a bit. And going in the direction of my library, [10:05.400 --> 10:12.640] it's centered, it's around a data format, a sparse matrix format, which I came up with. [10:12.640 --> 10:17.680] It's called recursive sparse blocks, because there is a recursive subdivisions. There are [10:17.680 --> 10:25.040] blocks which are sparse. And the reason, the motivation for this data structure is to not [10:25.040 --> 10:31.360] exclude the sparse blast operations. So, I have made compromises in order to allow sparse [10:31.360 --> 10:39.480] blast operations to be there. I didn't want to preclude these operations. So, it's a compromise. [10:39.480 --> 10:47.000] And the core idea here is to have, let's say, cache size blocks, more or less, and a way [10:47.000 --> 10:55.160] to give each multi-core core something to work with. So, it's oriented towards multi-core. [10:55.160 --> 11:01.480] It's not for GPUs, or not at the moment, at least. So, the matrices which you have seen [11:01.480 --> 11:07.520] before, with this data format, the data structure looks a bit like this. The color is based [11:07.520 --> 11:15.760] on the population, on the amount of matrices are there. Then there is another core coding [11:15.760 --> 11:20.080] with other information. But this is just to tell you that the irregular aspect of those [11:20.080 --> 11:33.160] matrices is reflected also here, to some extent. Yeah. So, the library itself wants to provide [11:33.160 --> 11:41.360] sparse blast. So, building blocks for iterative solvers. It's pretty compatible at the library. [11:41.360 --> 11:49.120] It works with C++, Fortune, Octave, and Python. I say it's quite compatible. So, it uses, [11:49.120 --> 11:54.240] let's say, established technologies. And it's quite compatible also in the sense with your [11:54.240 --> 11:58.600] software. It doesn't require you to use the only data structure which is custom is the [11:58.600 --> 12:07.880] matrix. You don't need extra data structures for vectors. And the program I saw before [12:07.880 --> 12:14.760] written in the sparse blast, for using RSV, uses just one extra init and finalized function. [12:14.760 --> 12:20.720] So, I really respect that API. But, however, it's nice to write also the 15th standard. [12:20.720 --> 12:28.440] Or joking. This is not the 15th standard, but just the internal API. So, if you want, [12:28.440 --> 12:33.360] perhaps you can exploit the internal API of Libre Sb, or not internal, but the native [12:33.360 --> 12:42.640] one. Please tell me when I'm at 10 minutes. Yeah. Which is primarily in C. Then you have [12:42.640 --> 12:53.640] wrappers with C++. And there's also the Fortune one. These are the native APIs. And what is [12:53.640 --> 13:02.760] specific about RSV is that the blocking is not so clear which blocking is best. Because, [13:02.760 --> 13:11.240] yeah, depending on how you block, you could have better or worse performance. And for [13:11.240 --> 13:17.560] this reason, there is an idea of using automated empirical optimization in this library. There [13:17.560 --> 13:26.720] is a special call, a function which you call when you invest time to ask the library to [13:26.720 --> 13:34.720] optimize a bit the data structure. So, you sacrifice a minute, perhaps, for optimizing [13:34.720 --> 13:41.440] the data structure a bit. And you do this in the hope that the many hours which we'll [13:41.440 --> 13:49.760] be using this matrix afterwards will be, will profit, will be decreased, thanks to the optimization. [13:49.760 --> 13:54.720] Because, as I said, this is meant to be used for iterative methods. So, you will be running [13:54.720 --> 14:00.360] this for many hours. And, therefore, spending a few minutes in automated optimization, it's [14:00.360 --> 14:06.480] something that should pay off. No guarantee, but that's the idea and that is usually how [14:06.480 --> 14:15.440] it goes. To give an idea, this C++ API is what you would expect. So, there is a class [14:15.440 --> 14:21.640] templated on the type. So, there is type safety here. When you say, this is my library, sorry, [14:21.640 --> 14:28.120] this is my matrix. These are my non-zeros because this is what we are representing here. [14:28.120 --> 14:38.160] We, you have flags, C-style flags for options like symmetry or asking for discarding zeros [14:38.160 --> 14:43.080] rather than keeping the zeros because sometimes you want to keep structural zeros for modifying [14:43.080 --> 14:47.560] them later, for instance. So, you have many such options here. And this is the way, this [14:47.560 --> 14:50.400] is why I'm showing this slide to tell you that there are many options which I'm not [14:50.400 --> 14:55.400] showing you here. Yeah. And the only data structure here is the RSB matrix, no other [14:55.400 --> 15:02.880] custom stuff. And you can exploit, you can enjoy the spam interface of C++ 20 that doesn't [15:02.880 --> 15:10.480] really force you to have any weird custom vector type apart from the standard C++ ones. [15:10.480 --> 15:18.360] If you want to use, for instance, GNU Octave and enjoy the multi-core speedup from Libar [15:18.360 --> 15:28.440] SB, you can use the sparse RSB plug-in which I wrote, which uses C++ Libar SB pretty efficiently. [15:28.440 --> 15:35.400] So apart from a few conversions, it should be, it should have almost native performance. [15:35.400 --> 15:47.040] Similarly for Python, the PyRSB plug-in for standalone, sorry, package has an interface [15:47.040 --> 15:57.680] which is copied from CSR matrix. So you use it mostly the same way. But underneath, Libar [15:57.680 --> 16:06.480] SB runs. You don't see it. Or you see it if you ask it to use the auto-tuning routine. [16:06.480 --> 16:13.040] Because as I said, in all of those language implementations, you can also use all of the [16:13.040 --> 16:19.360] functionality of Libar SB which includes the auto-tuning also here in Octave. And I want [16:19.360 --> 16:25.880] to stress this. GNU Octave doesn't have multi-threaded sparse operations. With Libar SB, you can [16:25.880 --> 16:33.400] have them. Same for SciPy sparse. As far as I know, it's not multi-threaded. With Libar [16:33.400 --> 16:43.920] SB, you get it. Libar SB is by default licensed as lesser GPL3. Which means you can, if you [16:43.920 --> 16:52.280] don't, as long as you don't modify it, you can distribute it with your proprietary code. [16:52.280 --> 16:59.520] If you modify it, well, it's more complicated. You have to release the modified version. [16:59.520 --> 17:06.600] The Libar SB library, if you want to learn to use it, it makes absolutely sense to use [17:06.600 --> 17:15.720] a packaged version from Debian Ubuntu or most of Linux distributions. Or if you use Windows [17:15.720 --> 17:23.320] and you can use Siegwin. Or once you want the performance, I mean, you can just compile [17:23.320 --> 17:28.800] it by yourself because it's quite trivial. Or enjoy what our colleagues here from Spark [17:28.800 --> 17:36.280] and EasyBuild have done and use the packaged version from those distributions. And some [17:36.280 --> 17:45.200] people have written wrappers for Rust and Julia. I don't know these languages, so I [17:45.200 --> 17:53.560] didn't use them. I think the Rust one is like the entire API. I think Julia is more in Julia [17:53.560 --> 18:03.200] style, so it's just what is, the core functionality is there, I think. Yeah, that was everything. [18:03.200 --> 18:24.680] I don't know how much time did I take. Oh, 50 minutes. So, thanks.