[00:00.000 --> 00:18.920] Next lightning talk is Jan Verschijl talking about multiple double arithmetic on GPUs. [00:18.920 --> 00:25.400] Thank you very much, the organizers for allowing me to speak here. [00:25.400 --> 00:36.120] So I will hope to talk about computations that I've been doing with multiple doubles. [00:36.120 --> 00:43.360] So the multiple doubles go back actually from the time when people, when the hardware was [00:43.360 --> 00:45.840] not yet supporting doubles. [00:45.840 --> 00:49.080] So this was the late 60s. [00:49.080 --> 00:51.720] So this is actually a similar idea. [00:51.720 --> 00:58.680] So you use the hardware arithmetic to extend your precision. [00:58.680 --> 01:00.600] It has a lot of advantages. [01:00.600 --> 01:07.320] So if you're used to working with complex arithmetic, then double-double arithmetic has about the [01:07.320 --> 01:10.400] same intensity. [01:10.400 --> 01:17.920] So speaking of intensities, relative to the previous talk where we were working with graphs, [01:17.920 --> 01:23.080] so in the previous talk I had the impression that everything was about graphs and there [01:23.080 --> 01:24.760] was a memory bound. [01:24.760 --> 01:27.440] My problems are compute bound. [01:27.440 --> 01:33.200] So I get really good arithmetic intensities. [01:33.200 --> 01:35.000] There are some disadvantages, of course. [01:35.000 --> 01:40.320] If you want to work with, say, 17 decimal places, you can't. [01:40.320 --> 01:46.040] Also if you want to work with truly infinite decimals, well, you can't either because [01:46.040 --> 01:52.600] you're still having your 11 bits of the exponent. [01:52.600 --> 01:57.240] Disadvantage might also be that you can still do numerical analysis. [01:57.240 --> 02:02.440] So this might be an advantage or disadvantage. [02:02.440 --> 02:05.960] I got into this by power series arithmetic. [02:05.960 --> 02:09.760] So this is about the EXP and the EPS. [02:09.760 --> 02:15.240] So when I started working with power series, I was using 11111111. [02:15.240 --> 02:17.120] And I know the binomial theorem. [02:17.120 --> 02:22.080] Well, I only knew it when I saw the numbers blowing up on me. [02:22.080 --> 02:24.800] So you know it when you don't know it. [02:24.800 --> 02:26.480] So here is a table. [02:26.480 --> 02:31.880] The exponential has a very nice development, nicely decaying. [02:31.880 --> 02:36.120] And if you multiply these exponentials, you don't have any blow-up. [02:36.120 --> 02:43.320] However, the last coefficient, if you want to represent that, and you have to think about [02:43.320 --> 02:44.320] GPUs. [02:44.320 --> 02:48.640] GPUs are actually quite happy if they can do things in groups of 32. [02:48.640 --> 02:56.440] So actually a 32 power series, an order 32 power series, is actually still very small [02:56.440 --> 02:57.440] for GPUs. [02:57.440 --> 03:00.040] But there you have already to use quad doubles. [03:00.040 --> 03:07.920] Otherwise, your last coefficients, you can't represent it anymore. [03:07.920 --> 03:09.920] OK. [03:09.920 --> 03:13.920] So I started working with the QD library. [03:13.920 --> 03:18.000] And then we were doing multi-core. [03:18.000 --> 03:24.680] Me and my student, Gennady Jofi, and we looked at each other, should we do this on the GPU? [03:24.680 --> 03:28.000] Should we write the entire library on the GPU? [03:28.000 --> 03:31.160] My student didn't really want to do it, and I didn't want to do it. [03:31.160 --> 03:35.680] But then we discovered GQD, and we used GQD. [03:35.680 --> 03:39.800] And the recent package that we are using is Compari. [03:39.800 --> 03:43.880] It's actually the only software I know that is named after a beverage. [03:43.880 --> 03:46.560] I don't know if that's a good sign or not. [03:46.560 --> 03:52.040] In my supermarket store in Chicago, I once saw Compari, but it's not my drink. [03:52.040 --> 03:58.040] So I didn't want to ruin the taste of using Compari. [03:58.040 --> 03:59.440] So I stayed off this. [03:59.440 --> 04:01.080] Compari is actually quite good. [04:01.080 --> 04:08.920] So because it allowed me to go to quad double, and now also octo double. [04:08.920 --> 04:15.840] The numbers in this table are kind of good, because I want to have really performance. [04:15.840 --> 04:22.040] But it also comes somehow misleading, because as soon as you're using complex double-double, [04:22.040 --> 04:24.320] everything becomes compute bound. [04:24.320 --> 04:30.320] And the problems that you have with memory transfer and all, you do a lot of arithmetic [04:30.320 --> 04:37.760] operations on a relatively small amount of data. [04:37.760 --> 04:40.360] I also like to do quality up. [04:40.360 --> 04:48.320] If you can afford the time for, say, a double precision calculation, well, you will see [04:48.320 --> 04:50.880] that everything is not really right. [04:50.880 --> 04:59.760] But then you can allow the same amount of time, and you quadruple the precision. [04:59.760 --> 05:09.600] So the 439 there, think about 1 gigaflop, 2 gigaflop, and then you go to teraflop. [05:09.600 --> 05:15.040] So the 439 is kind of, if you have teraflop performance, it's like as if you would be [05:15.040 --> 05:19.140] doing this on a single core. [05:19.140 --> 05:23.000] So I mentioned the funding agencies at the very slight. [05:23.000 --> 05:26.080] I would like to have a hopper. [05:26.080 --> 05:31.000] But so for now, I have to deal with Pascal and Volta. [05:31.000 --> 05:38.720] And the last one is a gaming laptop, which is also actually quite a powerful GPU. [05:38.720 --> 05:42.040] My first teraflop card was Kepler. [05:42.040 --> 05:46.960] And this last list of GPU actually gets there. [05:46.960 --> 05:49.960] Okay. [05:49.960 --> 05:54.480] If you think of a double-double, there is a double-two. [05:54.480 --> 05:57.480] And then for a quad-double, there is the double-four. [05:57.480 --> 06:00.520] So that was what the GQD was using. [06:00.520 --> 06:03.160] And that's very good for memory coalescing. [06:03.160 --> 06:09.040] But we actually got into trouble with the complex quad-double because there was no double-eight. [06:09.040 --> 06:15.680] So instead of working, if you work with a vector of quad-doubles, a vector of arrays [06:15.680 --> 06:20.360] of four length, you actually better use four vectors. [06:20.360 --> 06:25.800] The first one with the highest double, second double, third double, fourth double. [06:25.800 --> 06:29.160] So it's a little bit similar like working with power series. [06:29.160 --> 06:33.280] So power series is invertible if the leading coefficient is not zero. [06:33.280 --> 06:36.200] You can work with matrices of power series. [06:36.200 --> 06:37.280] But actually, that's not good. [06:37.280 --> 06:41.640] You should actually work with a series where the coefficients are matrices. [06:41.640 --> 06:43.360] Same idea here. [06:43.360 --> 06:47.760] QDLIP is a very good library still. [06:47.760 --> 06:48.940] It's quite complete. [06:48.940 --> 06:55.200] So I have extended the square root, for example, to octodouble precision. [06:55.200 --> 06:56.480] OK. [06:56.480 --> 07:01.840] So here is then my beginning. [07:01.840 --> 07:04.880] So I mentioned, so you saw this eight. [07:04.880 --> 07:12.400] So if you take a vector of random complex numbers, 64, then the norm is eight. [07:12.400 --> 07:13.400] Should be eight. [07:13.400 --> 07:16.480] So that's a really nice test property. [07:16.480 --> 07:22.200] If you work with GPUs, you actually define kernels, and kernels, the name says it itself, [07:22.200 --> 07:23.200] it should be small. [07:23.200 --> 07:24.800] So think small. [07:24.800 --> 07:29.120] And actually, this problem is a small problem, mathematically speaking, but it has all the [07:29.120 --> 07:35.480] richness and the complexity of all the problems that you will run into. [07:35.480 --> 07:39.200] You will have to study the prefix sum algorithm, for example. [07:39.200 --> 07:40.560] So that is needed. [07:40.560 --> 07:46.600] You also have to tune your software for large vectors or for small vectors. [07:46.600 --> 07:53.440] You can only have one block of threads that is active. [07:53.440 --> 07:56.680] The square root works with staggered. [07:56.680 --> 08:01.880] So you apply a Newton method. [08:01.880 --> 08:07.320] And then actually, this is where the dot comes in. [08:07.320 --> 08:15.600] So the nice thing about double doubles, quad doubles, is that everything fits into registers. [08:15.600 --> 08:19.120] So it's also very good if you do multi-core. [08:19.120 --> 08:21.120] So you don't have to use the heap ever. [08:21.120 --> 08:25.560] But of course, when you get to complex quad doubles, you have these eight arrays. [08:25.560 --> 08:29.160] If you do octodoubles, so it doubles and doubles and doubles. [08:29.160 --> 08:34.840] So I have with my old graphics cards, they can no longer even compile the octodoubles [08:34.840 --> 08:36.880] if you inline too much. [08:36.880 --> 08:45.960] So it's still very interesting that, actually, you have to tailor your kernels towards the [08:45.960 --> 08:48.760] precision levels. [08:48.760 --> 08:50.000] So here is my last slide. [08:50.000 --> 08:52.800] I did more than just norms. [08:52.800 --> 09:00.320] So we have teraflop performance when we evaluate polynomials and differentiate them. [09:00.320 --> 09:04.080] The QR, the blocked householder QR, is also wonderful. [09:04.080 --> 09:12.280] You get already teraflop performance with 1,000 in complex double-double. [09:12.280 --> 09:17.520] And then the last paper is where you try to combine these things by computing Taylor series [09:17.520 --> 09:24.720] for solutions of solution developments for polynomial systems. [09:24.720 --> 09:27.240] Newton's method is actually a quite nice operator. [09:27.240 --> 09:32.680] You start with a multivariate system where all the variables are linked to each other. [09:32.680 --> 09:38.880] And what Newton actually does, it spits out power series for each component. [09:38.880 --> 09:46.240] So actually, it untangles all the linearities, all the nonlinearities. [09:46.240 --> 09:47.640] So I listed the archive. [09:47.640 --> 09:52.160] So the IEEE puts things in a paywall, behind the paywall. [09:52.160 --> 09:55.240] So you have the archive versions there. [09:55.240 --> 09:59.000] And you're more than welcome to the bottom line of this slide. [09:59.000 --> 10:02.960] I mean, the conclusion, actually, is that all the software is free and open source. [10:02.960 --> 10:32.040] I'd have the GitHub handle there.