[00:00.000 --> 00:18.920]  Next lightning talk is Jan Verschijl talking about multiple double arithmetic on GPUs.
[00:18.920 --> 00:25.400]  Thank you very much, the organizers for allowing me to speak here.
[00:25.400 --> 00:36.120]  So I will hope to talk about computations that I've been doing with multiple doubles.
[00:36.120 --> 00:43.360]  So the multiple doubles go back actually from the time when people, when the hardware was
[00:43.360 --> 00:45.840]  not yet supporting doubles.
[00:45.840 --> 00:49.080]  So this was the late 60s.
[00:49.080 --> 00:51.720]  So this is actually a similar idea.
[00:51.720 --> 00:58.680]  So you use the hardware arithmetic to extend your precision.
[00:58.680 --> 01:00.600]  It has a lot of advantages.
[01:00.600 --> 01:07.320]  So if you're used to working with complex arithmetic, then double-double arithmetic has about the
[01:07.320 --> 01:10.400]  same intensity.
[01:10.400 --> 01:17.920]  So speaking of intensities, relative to the previous talk where we were working with graphs,
[01:17.920 --> 01:23.080]  so in the previous talk I had the impression that everything was about graphs and there
[01:23.080 --> 01:24.760]  was a memory bound.
[01:24.760 --> 01:27.440]  My problems are compute bound.
[01:27.440 --> 01:33.200]  So I get really good arithmetic intensities.
[01:33.200 --> 01:35.000]  There are some disadvantages, of course.
[01:35.000 --> 01:40.320]  If you want to work with, say, 17 decimal places, you can't.
[01:40.320 --> 01:46.040]  Also if you want to work with truly infinite decimals, well, you can't either because
[01:46.040 --> 01:52.600]  you're still having your 11 bits of the exponent.
[01:52.600 --> 01:57.240]  Disadvantage might also be that you can still do numerical analysis.
[01:57.240 --> 02:02.440]  So this might be an advantage or disadvantage.
[02:02.440 --> 02:05.960]  I got into this by power series arithmetic.
[02:05.960 --> 02:09.760]  So this is about the EXP and the EPS.
[02:09.760 --> 02:15.240]  So when I started working with power series, I was using 11111111.
[02:15.240 --> 02:17.120]  And I know the binomial theorem.
[02:17.120 --> 02:22.080]  Well, I only knew it when I saw the numbers blowing up on me.
[02:22.080 --> 02:24.800]  So you know it when you don't know it.
[02:24.800 --> 02:26.480]  So here is a table.
[02:26.480 --> 02:31.880]  The exponential has a very nice development, nicely decaying.
[02:31.880 --> 02:36.120]  And if you multiply these exponentials, you don't have any blow-up.
[02:36.120 --> 02:43.320]  However, the last coefficient, if you want to represent that, and you have to think about
[02:43.320 --> 02:44.320]  GPUs.
[02:44.320 --> 02:48.640]  GPUs are actually quite happy if they can do things in groups of 32.
[02:48.640 --> 02:56.440]  So actually a 32 power series, an order 32 power series, is actually still very small
[02:56.440 --> 02:57.440]  for GPUs.
[02:57.440 --> 03:00.040]  But there you have already to use quad doubles.
[03:00.040 --> 03:07.920]  Otherwise, your last coefficients, you can't represent it anymore.
[03:07.920 --> 03:09.920]  OK.
[03:09.920 --> 03:13.920]  So I started working with the QD library.
[03:13.920 --> 03:18.000]  And then we were doing multi-core.
[03:18.000 --> 03:24.680]  Me and my student, Gennady Jofi, and we looked at each other, should we do this on the GPU?
[03:24.680 --> 03:28.000]  Should we write the entire library on the GPU?
[03:28.000 --> 03:31.160]  My student didn't really want to do it, and I didn't want to do it.
[03:31.160 --> 03:35.680]  But then we discovered GQD, and we used GQD.
[03:35.680 --> 03:39.800]  And the recent package that we are using is Compari.
[03:39.800 --> 03:43.880]  It's actually the only software I know that is named after a beverage.
[03:43.880 --> 03:46.560]  I don't know if that's a good sign or not.
[03:46.560 --> 03:52.040]  In my supermarket store in Chicago, I once saw Compari, but it's not my drink.
[03:52.040 --> 03:58.040]  So I didn't want to ruin the taste of using Compari.
[03:58.040 --> 03:59.440]  So I stayed off this.
[03:59.440 --> 04:01.080]  Compari is actually quite good.
[04:01.080 --> 04:08.920]  So because it allowed me to go to quad double, and now also octo double.
[04:08.920 --> 04:15.840]  The numbers in this table are kind of good, because I want to have really performance.
[04:15.840 --> 04:22.040]  But it also comes somehow misleading, because as soon as you're using complex double-double,
[04:22.040 --> 04:24.320]  everything becomes compute bound.
[04:24.320 --> 04:30.320]  And the problems that you have with memory transfer and all, you do a lot of arithmetic
[04:30.320 --> 04:37.760]  operations on a relatively small amount of data.
[04:37.760 --> 04:40.360]  I also like to do quality up.
[04:40.360 --> 04:48.320]  If you can afford the time for, say, a double precision calculation, well, you will see
[04:48.320 --> 04:50.880]  that everything is not really right.
[04:50.880 --> 04:59.760]  But then you can allow the same amount of time, and you quadruple the precision.
[04:59.760 --> 05:09.600]  So the 439 there, think about 1 gigaflop, 2 gigaflop, and then you go to teraflop.
[05:09.600 --> 05:15.040]  So the 439 is kind of, if you have teraflop performance, it's like as if you would be
[05:15.040 --> 05:19.140]  doing this on a single core.
[05:19.140 --> 05:23.000]  So I mentioned the funding agencies at the very slight.
[05:23.000 --> 05:26.080]  I would like to have a hopper.
[05:26.080 --> 05:31.000]  But so for now, I have to deal with Pascal and Volta.
[05:31.000 --> 05:38.720]  And the last one is a gaming laptop, which is also actually quite a powerful GPU.
[05:38.720 --> 05:42.040]  My first teraflop card was Kepler.
[05:42.040 --> 05:46.960]  And this last list of GPU actually gets there.
[05:46.960 --> 05:49.960]  Okay.
[05:49.960 --> 05:54.480]  If you think of a double-double, there is a double-two.
[05:54.480 --> 05:57.480]  And then for a quad-double, there is the double-four.
[05:57.480 --> 06:00.520]  So that was what the GQD was using.
[06:00.520 --> 06:03.160]  And that's very good for memory coalescing.
[06:03.160 --> 06:09.040]  But we actually got into trouble with the complex quad-double because there was no double-eight.
[06:09.040 --> 06:15.680]  So instead of working, if you work with a vector of quad-doubles, a vector of arrays
[06:15.680 --> 06:20.360]  of four length, you actually better use four vectors.
[06:20.360 --> 06:25.800]  The first one with the highest double, second double, third double, fourth double.
[06:25.800 --> 06:29.160]  So it's a little bit similar like working with power series.
[06:29.160 --> 06:33.280]  So power series is invertible if the leading coefficient is not zero.
[06:33.280 --> 06:36.200]  You can work with matrices of power series.
[06:36.200 --> 06:37.280]  But actually, that's not good.
[06:37.280 --> 06:41.640]  You should actually work with a series where the coefficients are matrices.
[06:41.640 --> 06:43.360]  Same idea here.
[06:43.360 --> 06:47.760]  QDLIP is a very good library still.
[06:47.760 --> 06:48.940]  It's quite complete.
[06:48.940 --> 06:55.200]  So I have extended the square root, for example, to octodouble precision.
[06:55.200 --> 06:56.480]  OK.
[06:56.480 --> 07:01.840]  So here is then my beginning.
[07:01.840 --> 07:04.880]  So I mentioned, so you saw this eight.
[07:04.880 --> 07:12.400]  So if you take a vector of random complex numbers, 64, then the norm is eight.
[07:12.400 --> 07:13.400]  Should be eight.
[07:13.400 --> 07:16.480]  So that's a really nice test property.
[07:16.480 --> 07:22.200]  If you work with GPUs, you actually define kernels, and kernels, the name says it itself,
[07:22.200 --> 07:23.200]  it should be small.
[07:23.200 --> 07:24.800]  So think small.
[07:24.800 --> 07:29.120]  And actually, this problem is a small problem, mathematically speaking, but it has all the
[07:29.120 --> 07:35.480]  richness and the complexity of all the problems that you will run into.
[07:35.480 --> 07:39.200]  You will have to study the prefix sum algorithm, for example.
[07:39.200 --> 07:40.560]  So that is needed.
[07:40.560 --> 07:46.600]  You also have to tune your software for large vectors or for small vectors.
[07:46.600 --> 07:53.440]  You can only have one block of threads that is active.
[07:53.440 --> 07:56.680]  The square root works with staggered.
[07:56.680 --> 08:01.880]  So you apply a Newton method.
[08:01.880 --> 08:07.320]  And then actually, this is where the dot comes in.
[08:07.320 --> 08:15.600]  So the nice thing about double doubles, quad doubles, is that everything fits into registers.
[08:15.600 --> 08:19.120]  So it's also very good if you do multi-core.
[08:19.120 --> 08:21.120]  So you don't have to use the heap ever.
[08:21.120 --> 08:25.560]  But of course, when you get to complex quad doubles, you have these eight arrays.
[08:25.560 --> 08:29.160]  If you do octodoubles, so it doubles and doubles and doubles.
[08:29.160 --> 08:34.840]  So I have with my old graphics cards, they can no longer even compile the octodoubles
[08:34.840 --> 08:36.880]  if you inline too much.
[08:36.880 --> 08:45.960]  So it's still very interesting that, actually, you have to tailor your kernels towards the
[08:45.960 --> 08:48.760]  precision levels.
[08:48.760 --> 08:50.000]  So here is my last slide.
[08:50.000 --> 08:52.800]  I did more than just norms.
[08:52.800 --> 09:00.320]  So we have teraflop performance when we evaluate polynomials and differentiate them.
[09:00.320 --> 09:04.080]  The QR, the blocked householder QR, is also wonderful.
[09:04.080 --> 09:12.280]  You get already teraflop performance with 1,000 in complex double-double.
[09:12.280 --> 09:17.520]  And then the last paper is where you try to combine these things by computing Taylor series
[09:17.520 --> 09:24.720]  for solutions of solution developments for polynomial systems.
[09:24.720 --> 09:27.240]  Newton's method is actually a quite nice operator.
[09:27.240 --> 09:32.680]  You start with a multivariate system where all the variables are linked to each other.
[09:32.680 --> 09:38.880]  And what Newton actually does, it spits out power series for each component.
[09:38.880 --> 09:46.240]  So actually, it untangles all the linearities, all the nonlinearities.
[09:46.240 --> 09:47.640]  So I listed the archive.
[09:47.640 --> 09:52.160]  So the IEEE puts things in a paywall, behind the paywall.
[09:52.160 --> 09:55.240]  So you have the archive versions there.
[09:55.240 --> 09:59.000]  And you're more than welcome to the bottom line of this slide.
[09:59.000 --> 10:02.960]  I mean, the conclusion, actually, is that all the software is free and open source.
[10:02.960 --> 10:32.040]  I'd have the GitHub handle there.