[00:00.000 --> 00:11.960] Hello, everyone. Good afternoon. I am Thanos from the University of Manchester, and today [00:11.960 --> 00:16.960] I have the pleasure to present you Tornado VM, what is the state of Tornado VM at this [00:16.960 --> 00:24.440] moment. And in fact, I want to focus also on the slogan that's very known to everyone, [00:24.440 --> 00:31.800] the right ones run anywhere for Java. So, I will start with that. So, this is a known [00:31.800 --> 00:40.720] slogan derived since 90s from some micro systems in a way to advertise that Java language and [00:40.720 --> 00:47.000] the JVM in particular, it is a platform that can ensure portability across different CPU [00:47.000 --> 00:54.080] structures and architectures. So, the idea is that programmers can run their code, they [00:54.080 --> 01:02.680] can compile it once, and it can run transparently on different hardware architectures. However, [01:02.680 --> 01:09.280] hardware has changed in the last years. It is evolving, and perhaps this is not sufficient [01:09.280 --> 01:21.480] for the new types of hardware resources that are coming. So, lately we have GPUs and FPGAs [01:21.480 --> 01:27.880] which are coming to complement the power of the CPUs in a way to maximize performance [01:27.880 --> 01:36.080] and reduce the energy consumption. These are good, but there is some challenges that are [01:36.080 --> 01:43.600] deriving, and these are mainly posed in programmability. So, how programmers can harness this power [01:43.600 --> 01:50.600] from these resources. I don't know if you have experience with OpenCL and CUDA, but [01:50.600 --> 01:55.120] mainly these programming models that have been designed for these hardware types to [01:55.120 --> 02:01.520] get access to these hardware types, they are mainly focused on the C and C++ world. So, [02:01.520 --> 02:07.280] there are different programming models from different companies like SQL, one API, NVIDIA [02:07.280 --> 02:16.960] CUDA, OpenCL, which is a standard that can run on all the devices. And if you have FPGA [02:16.960 --> 02:23.560] expertise, then perhaps you can write RTL and Verilog, which is a hardware description [02:23.560 --> 02:27.520] language, but this is very low level. And here we are talking about Java, so we want [02:27.520 --> 02:37.600] to go high level. So, if you are a Java developer, then you use the JVM and you go to the CPU. [02:37.600 --> 02:44.920] If you want to have access to these devices, then you need to write your own native interfaces [02:44.920 --> 02:51.720] in the JNI and then tap into the C and C++ world. But still, you need to be aware of [02:51.720 --> 02:58.160] how these programming models are written. So, you need to be familiar with this. And [02:58.160 --> 03:03.960] this is exactly the problem that Tornado VM has been designed to solve. So, Tornado [03:03.960 --> 03:10.560] VM, it is a plug-in to existing OpenJDK distributions, like Amazon Goreto, Red Hat [03:10.560 --> 03:17.240] Mandrel, Azul Zulu, and others. And the way that it is built, it is to enable hardware [03:17.240 --> 03:25.960] acceleration in an easy manner. So, it offers a Java API and it has inside a JIT compiler [03:25.960 --> 03:33.240] for the hardware devices that are showing this figure. Our compiler inside, it can automatically [03:33.240 --> 03:42.520] translate the Java bytecodes to run on CPUs, multi-code CPUs, GPUs, integrated or discrete [03:42.520 --> 03:49.720] GPUs and FPGAs. And the compiler in the backend, it has three different backend types. It can [03:49.720 --> 03:57.400] emit OpenCLC, PDX, which is the assembly for the CUDA, for the NVIDIA GPUs. And it has [03:57.400 --> 04:03.960] recently also the SPIRV backend, which enables to utilize the level zero dispatcher from the [04:03.960 --> 04:13.680] one API. So, Tornado VM, it is a technology that can be used as a JVM plug-in to enable [04:13.680 --> 04:20.080] hardware acceleration for JVMs. And some of the key features is that it has a lightweight [04:20.080 --> 04:27.360] Java API. It is coded in a platform agnostic manner, so one command can be the same no [04:27.360 --> 04:34.800] matter which device it will be executed to program. And it can transparently, at the [04:34.800 --> 04:40.440] compile time, specialize the code. Because the code that is generated for the GPU, it [04:40.440 --> 04:48.440] is completely different from a code that is generated for an FPGA. So, regarding the compiler, [04:48.440 --> 04:55.000] we have different phases that will be enabled for GPUs and different phases that will be [04:55.000 --> 05:06.520] enabled to specialize the code for an FPGA. Our code is available in GitHub, so we encourage [05:06.520 --> 05:12.080] everyone who wants to have a look to fork it, download, play with examples, or even create [05:12.080 --> 05:20.920] their own examples. And also to come back to us. I mean, feel free to use the discussions [05:20.920 --> 05:26.360] to trigger the discussion if you have questions or to open issues if something is broken in [05:26.360 --> 05:33.520] order to fix it. And we have also available docker images for NVIDIA GPUs and Intel integrated [05:33.520 --> 05:43.920] GPUs. Now, the next part that I want to talk, it is regarding the API. Two weeks ago, we [05:43.920 --> 05:51.120] released a new version of Tornado VM, the version 0.15. And this comes with many new [05:51.120 --> 05:58.760] changes in the API level. So, our goal was to make the API easier for Java programmers [05:58.760 --> 06:04.000] in order to use it in a comprehensive manner. So, to know how to use and how to express [06:04.000 --> 06:11.760] parallelism from Java. But first, I have to make you familiar with the programming model [06:11.760 --> 06:18.680] of Tornado VM. And this programming model comes, it is inspired from the heterogeneous [06:18.680 --> 06:25.680] programming models like OpenCL and CUDA, the way that these programming models are operating. [06:25.680 --> 06:32.240] And in this sense, a Java program can be composed of two parts. The first part is the host code, [06:32.240 --> 06:38.280] where it is the actual core of the Java application. And the second part, it is the accelerated [06:38.280 --> 06:44.420] code, which actually it is the method or the set of methods that will be offloaded for [06:44.420 --> 06:51.280] execution on a GPU. Once we have made this clear, then we can move with the execution [06:51.280 --> 06:58.760] model, which it requires first because the processing will take place on a device. It [06:58.760 --> 07:06.360] will have first to move the data from the host code, from the CPU to the actual device. [07:06.360 --> 07:11.400] Then perform the processing. And once the processing is finished, then the data, the [07:11.400 --> 07:20.320] result, will have to be transferred back to the host code. Now, in Tornado VM, in the [07:20.320 --> 07:26.480] API of Tornado VM, we have exposed the set of objects and annotations for each of these [07:26.480 --> 07:32.880] two parts, the host code and accelerated code. In the host code, we have the task graph object [07:32.880 --> 07:39.840] and the Tornado execution plan. The task graph corresponds to what to run on the GPU. And [07:39.840 --> 07:46.040] the Tornado execution plan, it is how to run on the GPU. And for the accelerated code, [07:46.040 --> 07:55.200] we have a set of annotations and objects that I will show you later. So let's start with [07:55.200 --> 08:01.200] the task graph, what to run. Assuming that you are a Java programmer, then you want to [08:01.200 --> 08:08.040] offload the execution of a method. In this example, method A to the GPU. This method, [08:08.040 --> 08:14.960] it has some input and some output. Now, this method corresponds to what in the Tornado [08:14.960 --> 08:22.080] VM terminology call a task. So each method that will be offloaded for execution on hardware [08:22.080 --> 08:30.120] acceleration, it is a task. And it has the input data and the output data. And then we [08:30.120 --> 08:36.520] have a group of tasks, which is the task graph. Now, task graph can be a group of tasks that [08:36.520 --> 08:41.920] may have dependency or may not have dependency. And the programmers, they want to offload [08:41.920 --> 08:47.960] them all for hardware acceleration. In this particular example, I have put one task in [08:47.960 --> 08:55.600] this task graph. Once we have defined what to run, one question that comes, it is how [08:55.600 --> 09:03.400] often to transfer the data between the host, CPU and the device. And this can have a tremendous [09:03.400 --> 09:11.320] impact because it can affect the data transfer time. So it can have a long execution time. [09:11.320 --> 09:17.480] So it can affect performance, but can also affect energy, the power consumption. So [09:17.480 --> 09:24.480] how to transfer data? It matters. It depends on the pattern of the application. So one [09:24.480 --> 09:30.280] application may need to copy only the first execution if the data are read only, then [09:30.280 --> 09:36.960] always or only in the last execution, for example, for the output for the result. And [09:36.960 --> 09:43.280] here is a code snippet of how the task graph can be used to define this functionality in [09:43.280 --> 09:49.720] the Tornado API. So we create a new object, the task graph. We assign a name, which is [09:49.720 --> 09:57.040] a string. In this particular example, it is TG. And then we utilize the exposed methods [09:57.040 --> 10:04.760] of the API in order to fulfill the execution model. At first, we use the transfer to device, [10:04.760 --> 10:10.960] which has two inputs. The first argument that we put, it is the data transfer mode, which [10:10.960 --> 10:18.400] will be used to trigger how often the data will be moved. In this particular example, [10:18.400 --> 10:23.560] it is the first execution. So only in the first execution, the data will be moved. And [10:23.560 --> 10:31.320] then we have the parameter, which is the input array. The second method, it is the task, [10:31.320 --> 10:37.560] and it defines which method will be used for hardware acceleration. The first parameter, [10:37.560 --> 10:44.040] it is a name, a string, actually, of the method. It could be any name. And this is associated [10:44.040 --> 10:49.960] for the dynamic configuration, which I will show you later. The second parameter, it is [10:49.960 --> 10:55.840] the method reference. So the reference to the method that will be offloaded to the GPU [10:55.840 --> 11:01.080] for acceleration. And then it is the list of parameters of this method that corresponds [11:01.080 --> 11:09.800] to the method signature. And the last method, it is the transfer to host. And this, again, [11:09.800 --> 11:16.160] this method, it is configured the first argument to be the data transfer mode. And this example, [11:16.160 --> 11:25.800] we will copy the data, the output, in every execution. Okay. And once we have defined [11:25.800 --> 11:32.880] the task through the task graph, this task can be appended, can be updated. We can add [11:32.880 --> 11:40.720] a new task, a second task. We can change the way that the data transfers will be triggered [11:40.720 --> 11:47.840] in every execution only in the first execution. Then the next step, it is to define the immutable [11:47.840 --> 11:56.360] state of the task graph. So how to preserve the shape of a task graph. And this is done [11:56.360 --> 12:03.320] by taking a snapshot of the task graph, by using the snapshot method in the task graph [12:03.320 --> 12:10.000] object. Then we retrieve back an immutable task graph. And this means that this can be [12:10.000 --> 12:16.760] used for jit compilation and execution on the hardware. And this ensures that the Java [12:16.760 --> 12:22.800] programmers, they can create different versions of their task graph. They can update it. And [12:22.800 --> 12:28.600] then the code cache that we have in Tornado VM, it can store all these versions. It doesn't [12:28.600 --> 12:37.580] need to recompile and override the generated code. And this is the final step before we [12:37.580 --> 12:44.200] move to the actual execution plan. We have the immutable state of the task graph that [12:44.200 --> 12:49.680] can be modified and the immutable task graph that it cannot be modified anymore. So if [12:49.680 --> 12:55.920] the users they want to do a change, they can still change the task graph and get a new [12:55.920 --> 13:06.000] snapshot for a second version of their code. And now we move to how to run, how to execute [13:06.000 --> 13:14.400] the task graph. And this is done through the execution plan. Here is a snippet of Tornado [13:14.400 --> 13:20.720] execution plan. We create a new object that accepts as input the immutable task graph [13:20.720 --> 13:25.880] that doesn't change anymore. And then we can either directly execute it in the default [13:25.880 --> 13:33.160] execution mode by invoking the execution plan.execute method or we can configure it with some [13:33.160 --> 13:40.120] various optimizations. In this particular example, I have enabled the configuration [13:40.120 --> 13:46.000] to run with dynamic reconfiguration, which is a feature in Tornado VM that will launch [13:46.000 --> 13:51.560] a Java thread to GIT compile and execute the application per device that is available [13:51.560 --> 13:57.720] on the system. So we can have a CPU, a GPU and an FPGA. Java thread will run for all [13:57.720 --> 14:02.880] the devices and then it is triggered with a policy of performance, which means that the [14:02.880 --> 14:08.680] first device that will finish the execution, it will be the best and the rest Java threads [14:08.680 --> 14:17.560] will be killed. Now I have concluded the part of the host [14:17.560 --> 14:24.000] code. We can briefly go to the accelerated code, which is the way to express parallelism [14:24.000 --> 14:30.400] within the kernel, within the method or the Tornado VM task, as we call it. We have two [14:30.400 --> 14:38.680] ways, two APIs. The first one is called loop parallel API. And in a sense, we expose the [14:38.680 --> 14:45.560] parallel annotations that can be used by programmers as a hint to the Tornado VM GIT compiler that [14:45.560 --> 14:54.920] these loops can run in parallel. And the second one is the kernel API, which is an API exposed [14:54.920 --> 15:01.640] to the users through the kernel context object. And in a sense, the meaning of this API, it [15:01.640 --> 15:08.440] is meant for OpenCL and CUDA programmers, or Java programmers who know OpenCL, in a way [15:08.440 --> 15:15.200] to get more freedom on how to code things so they can get access to local memory, which [15:15.200 --> 15:21.840] is the equivalent to the cache memory of the CPU for GPUs. So they have more freedom on [15:21.840 --> 15:28.080] what to express. And in fact, I have used this API to port existing kernels written [15:28.080 --> 15:37.560] in OpenCL and CUDA to Java. For more information, you can use this link, which is the actual [15:37.560 --> 15:45.360] documentation of Tornado VM and describes some examples. I will briefly go to one example [15:45.360 --> 15:51.160] of a matrix multiplication, which I presented last year in FOSDEM. So in this example, we [15:51.160 --> 16:01.000] have the accelerated code and the host code. The matrix multiplication method, it implements [16:01.000 --> 16:08.680] matrix multiplication over a flattened arrays in two dimensions. And the way to annotate [16:08.680 --> 16:14.320] and express parallelism using the add parallel annotation, it would be to add the add parallel [16:14.320 --> 16:19.800] annotation inside the four loops. That means that we indicate that these loops could be [16:19.800 --> 16:29.040] executed in parallel. And now regarding the second API, the kernel API, we would use the [16:29.040 --> 16:39.120] kernel context object. And in particular, we would use the global ID X and Y, which correspond [16:39.120 --> 16:43.800] to the two dimensions that we have. So in a sense, it is like having the thread ID that [16:43.800 --> 17:00.720] will execute on the GPU. Here are some use cases that we used on Tornado VM. And concluding [17:00.720 --> 17:05.480] this talk, I would like to focus on a feature that we implemented in a research project [17:05.480 --> 17:11.640] that we are working. It is called elegant. And the idea is to create a software stack [17:11.640 --> 17:19.320] that unifies development for big data and IoT deployment. And there Tornado VM is used [17:19.320 --> 17:24.240] as a technology to enable acceleration as a service. So we have implemented the REST [17:24.240 --> 17:31.080] API. It is still a prototype. But the programmers, they can write a method. They can specify [17:31.080 --> 17:36.400] a method. They can specify the characteristics of the targeted device. And then the service [17:36.400 --> 17:44.320] will return back OpenCL code that it is meant to run parallel dysfunction. The interesting [17:44.320 --> 17:51.000] part is that this code, the OpenCL code, it is generated to be portable across different [17:51.000 --> 17:59.680] programming languages. So it doesn't only bind to Java. It can run also through C++, [17:59.680 --> 18:08.600] Python because it is OpenCL. And this means that in this particular example, we have Java. [18:08.600 --> 18:14.840] We use OpenZDK. We take the byte code and we pass the byte code to Tornado VM. And Tornado [18:14.840 --> 18:20.840] VM is running on an experimental feature which is called code interoperability mode. And [18:20.840 --> 18:27.560] in this mode, it converts this byte code to OpenCL that can run from any programming language [18:27.560 --> 18:36.400] and run time. Therefore, it is like prototyping in Java for parallel programming. [18:36.400 --> 18:41.920] Wrapping up, we would like to receive feedback. And we are looking also for collaborations [18:41.920 --> 18:52.080] if we can help to port use cases or for any other issues. And summarizing this talk, I [18:52.080 --> 18:57.800] briefly went through the right ones, run anywhere in the context of heterogeneous hardware [18:57.800 --> 19:03.800] acceleration. I have familiarized you with Tornado VM which is an open source project [19:03.800 --> 19:10.080] and the code base is available in GitHub. And I familiarize you with the programming [19:10.080 --> 19:18.320] model of Tornado VM and the new API, how to use it. And more are about to come in the [19:18.320 --> 19:24.360] FUJ blog with a new blog. So finally, just to acknowledge the projects that they have [19:24.360 --> 19:51.840] supported our research in the University of Manchester. And I'm ready for questions.