[00:00.000 --> 00:11.960]  Hello, everyone. Good afternoon. I am Thanos from the University of Manchester, and today
[00:11.960 --> 00:16.960]  I have the pleasure to present you Tornado VM, what is the state of Tornado VM at this
[00:16.960 --> 00:24.440]  moment. And in fact, I want to focus also on the slogan that's very known to everyone,
[00:24.440 --> 00:31.800]  the right ones run anywhere for Java. So, I will start with that. So, this is a known
[00:31.800 --> 00:40.720]  slogan derived since 90s from some micro systems in a way to advertise that Java language and
[00:40.720 --> 00:47.000]  the JVM in particular, it is a platform that can ensure portability across different CPU
[00:47.000 --> 00:54.080]  structures and architectures. So, the idea is that programmers can run their code, they
[00:54.080 --> 01:02.680]  can compile it once, and it can run transparently on different hardware architectures. However,
[01:02.680 --> 01:09.280]  hardware has changed in the last years. It is evolving, and perhaps this is not sufficient
[01:09.280 --> 01:21.480]  for the new types of hardware resources that are coming. So, lately we have GPUs and FPGAs
[01:21.480 --> 01:27.880]  which are coming to complement the power of the CPUs in a way to maximize performance
[01:27.880 --> 01:36.080]  and reduce the energy consumption. These are good, but there is some challenges that are
[01:36.080 --> 01:43.600]  deriving, and these are mainly posed in programmability. So, how programmers can harness this power
[01:43.600 --> 01:50.600]  from these resources. I don't know if you have experience with OpenCL and CUDA, but
[01:50.600 --> 01:55.120]  mainly these programming models that have been designed for these hardware types to
[01:55.120 --> 02:01.520]  get access to these hardware types, they are mainly focused on the C and C++ world. So,
[02:01.520 --> 02:07.280]  there are different programming models from different companies like SQL, one API, NVIDIA
[02:07.280 --> 02:16.960]  CUDA, OpenCL, which is a standard that can run on all the devices. And if you have FPGA
[02:16.960 --> 02:23.560]  expertise, then perhaps you can write RTL and Verilog, which is a hardware description
[02:23.560 --> 02:27.520]  language, but this is very low level. And here we are talking about Java, so we want
[02:27.520 --> 02:37.600]  to go high level. So, if you are a Java developer, then you use the JVM and you go to the CPU.
[02:37.600 --> 02:44.920]  If you want to have access to these devices, then you need to write your own native interfaces
[02:44.920 --> 02:51.720]  in the JNI and then tap into the C and C++ world. But still, you need to be aware of
[02:51.720 --> 02:58.160]  how these programming models are written. So, you need to be familiar with this. And
[02:58.160 --> 03:03.960]  this is exactly the problem that Tornado VM has been designed to solve. So, Tornado
[03:03.960 --> 03:10.560]  VM, it is a plug-in to existing OpenJDK distributions, like Amazon Goreto, Red Hat
[03:10.560 --> 03:17.240]  Mandrel, Azul Zulu, and others. And the way that it is built, it is to enable hardware
[03:17.240 --> 03:25.960]  acceleration in an easy manner. So, it offers a Java API and it has inside a JIT compiler
[03:25.960 --> 03:33.240]  for the hardware devices that are showing this figure. Our compiler inside, it can automatically
[03:33.240 --> 03:42.520]  translate the Java bytecodes to run on CPUs, multi-code CPUs, GPUs, integrated or discrete
[03:42.520 --> 03:49.720]  GPUs and FPGAs. And the compiler in the backend, it has three different backend types. It can
[03:49.720 --> 03:57.400]  emit OpenCLC, PDX, which is the assembly for the CUDA, for the NVIDIA GPUs. And it has
[03:57.400 --> 04:03.960]  recently also the SPIRV backend, which enables to utilize the level zero dispatcher from the
[04:03.960 --> 04:13.680]  one API. So, Tornado VM, it is a technology that can be used as a JVM plug-in to enable
[04:13.680 --> 04:20.080]  hardware acceleration for JVMs. And some of the key features is that it has a lightweight
[04:20.080 --> 04:27.360]  Java API. It is coded in a platform agnostic manner, so one command can be the same no
[04:27.360 --> 04:34.800]  matter which device it will be executed to program. And it can transparently, at the
[04:34.800 --> 04:40.440]  compile time, specialize the code. Because the code that is generated for the GPU, it
[04:40.440 --> 04:48.440]  is completely different from a code that is generated for an FPGA. So, regarding the compiler,
[04:48.440 --> 04:55.000]  we have different phases that will be enabled for GPUs and different phases that will be
[04:55.000 --> 05:06.520]  enabled to specialize the code for an FPGA. Our code is available in GitHub, so we encourage
[05:06.520 --> 05:12.080]  everyone who wants to have a look to fork it, download, play with examples, or even create
[05:12.080 --> 05:20.920]  their own examples. And also to come back to us. I mean, feel free to use the discussions
[05:20.920 --> 05:26.360]  to trigger the discussion if you have questions or to open issues if something is broken in
[05:26.360 --> 05:33.520]  order to fix it. And we have also available docker images for NVIDIA GPUs and Intel integrated
[05:33.520 --> 05:43.920]  GPUs. Now, the next part that I want to talk, it is regarding the API. Two weeks ago, we
[05:43.920 --> 05:51.120]  released a new version of Tornado VM, the version 0.15. And this comes with many new
[05:51.120 --> 05:58.760]  changes in the API level. So, our goal was to make the API easier for Java programmers
[05:58.760 --> 06:04.000]  in order to use it in a comprehensive manner. So, to know how to use and how to express
[06:04.000 --> 06:11.760]  parallelism from Java. But first, I have to make you familiar with the programming model
[06:11.760 --> 06:18.680]  of Tornado VM. And this programming model comes, it is inspired from the heterogeneous
[06:18.680 --> 06:25.680]  programming models like OpenCL and CUDA, the way that these programming models are operating.
[06:25.680 --> 06:32.240]  And in this sense, a Java program can be composed of two parts. The first part is the host code,
[06:32.240 --> 06:38.280]  where it is the actual core of the Java application. And the second part, it is the accelerated
[06:38.280 --> 06:44.420]  code, which actually it is the method or the set of methods that will be offloaded for
[06:44.420 --> 06:51.280]  execution on a GPU. Once we have made this clear, then we can move with the execution
[06:51.280 --> 06:58.760]  model, which it requires first because the processing will take place on a device. It
[06:58.760 --> 07:06.360]  will have first to move the data from the host code, from the CPU to the actual device.
[07:06.360 --> 07:11.400]  Then perform the processing. And once the processing is finished, then the data, the
[07:11.400 --> 07:20.320]  result, will have to be transferred back to the host code. Now, in Tornado VM, in the
[07:20.320 --> 07:26.480]  API of Tornado VM, we have exposed the set of objects and annotations for each of these
[07:26.480 --> 07:32.880]  two parts, the host code and accelerated code. In the host code, we have the task graph object
[07:32.880 --> 07:39.840]  and the Tornado execution plan. The task graph corresponds to what to run on the GPU. And
[07:39.840 --> 07:46.040]  the Tornado execution plan, it is how to run on the GPU. And for the accelerated code,
[07:46.040 --> 07:55.200]  we have a set of annotations and objects that I will show you later. So let's start with
[07:55.200 --> 08:01.200]  the task graph, what to run. Assuming that you are a Java programmer, then you want to
[08:01.200 --> 08:08.040]  offload the execution of a method. In this example, method A to the GPU. This method,
[08:08.040 --> 08:14.960]  it has some input and some output. Now, this method corresponds to what in the Tornado
[08:14.960 --> 08:22.080]  VM terminology call a task. So each method that will be offloaded for execution on hardware
[08:22.080 --> 08:30.120]  acceleration, it is a task. And it has the input data and the output data. And then we
[08:30.120 --> 08:36.520]  have a group of tasks, which is the task graph. Now, task graph can be a group of tasks that
[08:36.520 --> 08:41.920]  may have dependency or may not have dependency. And the programmers, they want to offload
[08:41.920 --> 08:47.960]  them all for hardware acceleration. In this particular example, I have put one task in
[08:47.960 --> 08:55.600]  this task graph. Once we have defined what to run, one question that comes, it is how
[08:55.600 --> 09:03.400]  often to transfer the data between the host, CPU and the device. And this can have a tremendous
[09:03.400 --> 09:11.320]  impact because it can affect the data transfer time. So it can have a long execution time.
[09:11.320 --> 09:17.480]  So it can affect performance, but can also affect energy, the power consumption. So
[09:17.480 --> 09:24.480]  how to transfer data? It matters. It depends on the pattern of the application. So one
[09:24.480 --> 09:30.280]  application may need to copy only the first execution if the data are read only, then
[09:30.280 --> 09:36.960]  always or only in the last execution, for example, for the output for the result. And
[09:36.960 --> 09:43.280]  here is a code snippet of how the task graph can be used to define this functionality in
[09:43.280 --> 09:49.720]  the Tornado API. So we create a new object, the task graph. We assign a name, which is
[09:49.720 --> 09:57.040]  a string. In this particular example, it is TG. And then we utilize the exposed methods
[09:57.040 --> 10:04.760]  of the API in order to fulfill the execution model. At first, we use the transfer to device,
[10:04.760 --> 10:10.960]  which has two inputs. The first argument that we put, it is the data transfer mode, which
[10:10.960 --> 10:18.400]  will be used to trigger how often the data will be moved. In this particular example,
[10:18.400 --> 10:23.560]  it is the first execution. So only in the first execution, the data will be moved. And
[10:23.560 --> 10:31.320]  then we have the parameter, which is the input array. The second method, it is the task,
[10:31.320 --> 10:37.560]  and it defines which method will be used for hardware acceleration. The first parameter,
[10:37.560 --> 10:44.040]  it is a name, a string, actually, of the method. It could be any name. And this is associated
[10:44.040 --> 10:49.960]  for the dynamic configuration, which I will show you later. The second parameter, it is
[10:49.960 --> 10:55.840]  the method reference. So the reference to the method that will be offloaded to the GPU
[10:55.840 --> 11:01.080]  for acceleration. And then it is the list of parameters of this method that corresponds
[11:01.080 --> 11:09.800]  to the method signature. And the last method, it is the transfer to host. And this, again,
[11:09.800 --> 11:16.160]  this method, it is configured the first argument to be the data transfer mode. And this example,
[11:16.160 --> 11:25.800]  we will copy the data, the output, in every execution. Okay. And once we have defined
[11:25.800 --> 11:32.880]  the task through the task graph, this task can be appended, can be updated. We can add
[11:32.880 --> 11:40.720]  a new task, a second task. We can change the way that the data transfers will be triggered
[11:40.720 --> 11:47.840]  in every execution only in the first execution. Then the next step, it is to define the immutable
[11:47.840 --> 11:56.360]  state of the task graph. So how to preserve the shape of a task graph. And this is done
[11:56.360 --> 12:03.320]  by taking a snapshot of the task graph, by using the snapshot method in the task graph
[12:03.320 --> 12:10.000]  object. Then we retrieve back an immutable task graph. And this means that this can be
[12:10.000 --> 12:16.760]  used for jit compilation and execution on the hardware. And this ensures that the Java
[12:16.760 --> 12:22.800]  programmers, they can create different versions of their task graph. They can update it. And
[12:22.800 --> 12:28.600]  then the code cache that we have in Tornado VM, it can store all these versions. It doesn't
[12:28.600 --> 12:37.580]  need to recompile and override the generated code. And this is the final step before we
[12:37.580 --> 12:44.200]  move to the actual execution plan. We have the immutable state of the task graph that
[12:44.200 --> 12:49.680]  can be modified and the immutable task graph that it cannot be modified anymore. So if
[12:49.680 --> 12:55.920]  the users they want to do a change, they can still change the task graph and get a new
[12:55.920 --> 13:06.000]  snapshot for a second version of their code. And now we move to how to run, how to execute
[13:06.000 --> 13:14.400]  the task graph. And this is done through the execution plan. Here is a snippet of Tornado
[13:14.400 --> 13:20.720]  execution plan. We create a new object that accepts as input the immutable task graph
[13:20.720 --> 13:25.880]  that doesn't change anymore. And then we can either directly execute it in the default
[13:25.880 --> 13:33.160]  execution mode by invoking the execution plan.execute method or we can configure it with some
[13:33.160 --> 13:40.120]  various optimizations. In this particular example, I have enabled the configuration
[13:40.120 --> 13:46.000]  to run with dynamic reconfiguration, which is a feature in Tornado VM that will launch
[13:46.000 --> 13:51.560]  a Java thread to GIT compile and execute the application per device that is available
[13:51.560 --> 13:57.720]  on the system. So we can have a CPU, a GPU and an FPGA. Java thread will run for all
[13:57.720 --> 14:02.880]  the devices and then it is triggered with a policy of performance, which means that the
[14:02.880 --> 14:08.680]  first device that will finish the execution, it will be the best and the rest Java threads
[14:08.680 --> 14:17.560]  will be killed. Now I have concluded the part of the host
[14:17.560 --> 14:24.000]  code. We can briefly go to the accelerated code, which is the way to express parallelism
[14:24.000 --> 14:30.400]  within the kernel, within the method or the Tornado VM task, as we call it. We have two
[14:30.400 --> 14:38.680]  ways, two APIs. The first one is called loop parallel API. And in a sense, we expose the
[14:38.680 --> 14:45.560]  parallel annotations that can be used by programmers as a hint to the Tornado VM GIT compiler that
[14:45.560 --> 14:54.920]  these loops can run in parallel. And the second one is the kernel API, which is an API exposed
[14:54.920 --> 15:01.640]  to the users through the kernel context object. And in a sense, the meaning of this API, it
[15:01.640 --> 15:08.440]  is meant for OpenCL and CUDA programmers, or Java programmers who know OpenCL, in a way
[15:08.440 --> 15:15.200]  to get more freedom on how to code things so they can get access to local memory, which
[15:15.200 --> 15:21.840]  is the equivalent to the cache memory of the CPU for GPUs. So they have more freedom on
[15:21.840 --> 15:28.080]  what to express. And in fact, I have used this API to port existing kernels written
[15:28.080 --> 15:37.560]  in OpenCL and CUDA to Java. For more information, you can use this link, which is the actual
[15:37.560 --> 15:45.360]  documentation of Tornado VM and describes some examples. I will briefly go to one example
[15:45.360 --> 15:51.160]  of a matrix multiplication, which I presented last year in FOSDEM. So in this example, we
[15:51.160 --> 16:01.000]  have the accelerated code and the host code. The matrix multiplication method, it implements
[16:01.000 --> 16:08.680]  matrix multiplication over a flattened arrays in two dimensions. And the way to annotate
[16:08.680 --> 16:14.320]  and express parallelism using the add parallel annotation, it would be to add the add parallel
[16:14.320 --> 16:19.800]  annotation inside the four loops. That means that we indicate that these loops could be
[16:19.800 --> 16:29.040]  executed in parallel. And now regarding the second API, the kernel API, we would use the
[16:29.040 --> 16:39.120]  kernel context object. And in particular, we would use the global ID X and Y, which correspond
[16:39.120 --> 16:43.800]  to the two dimensions that we have. So in a sense, it is like having the thread ID that
[16:43.800 --> 17:00.720]  will execute on the GPU. Here are some use cases that we used on Tornado VM. And concluding
[17:00.720 --> 17:05.480]  this talk, I would like to focus on a feature that we implemented in a research project
[17:05.480 --> 17:11.640]  that we are working. It is called elegant. And the idea is to create a software stack
[17:11.640 --> 17:19.320]  that unifies development for big data and IoT deployment. And there Tornado VM is used
[17:19.320 --> 17:24.240]  as a technology to enable acceleration as a service. So we have implemented the REST
[17:24.240 --> 17:31.080]  API. It is still a prototype. But the programmers, they can write a method. They can specify
[17:31.080 --> 17:36.400]  a method. They can specify the characteristics of the targeted device. And then the service
[17:36.400 --> 17:44.320]  will return back OpenCL code that it is meant to run parallel dysfunction. The interesting
[17:44.320 --> 17:51.000]  part is that this code, the OpenCL code, it is generated to be portable across different
[17:51.000 --> 17:59.680]  programming languages. So it doesn't only bind to Java. It can run also through C++,
[17:59.680 --> 18:08.600]  Python because it is OpenCL. And this means that in this particular example, we have Java.
[18:08.600 --> 18:14.840]  We use OpenZDK. We take the byte code and we pass the byte code to Tornado VM. And Tornado
[18:14.840 --> 18:20.840]  VM is running on an experimental feature which is called code interoperability mode. And
[18:20.840 --> 18:27.560]  in this mode, it converts this byte code to OpenCL that can run from any programming language
[18:27.560 --> 18:36.400]  and run time. Therefore, it is like prototyping in Java for parallel programming.
[18:36.400 --> 18:41.920]  Wrapping up, we would like to receive feedback. And we are looking also for collaborations
[18:41.920 --> 18:52.080]  if we can help to port use cases or for any other issues. And summarizing this talk, I
[18:52.080 --> 18:57.800]  briefly went through the right ones, run anywhere in the context of heterogeneous hardware
[18:57.800 --> 19:03.800]  acceleration. I have familiarized you with Tornado VM which is an open source project
[19:03.800 --> 19:10.080]  and the code base is available in GitHub. And I familiarize you with the programming
[19:10.080 --> 19:18.320]  model of Tornado VM and the new API, how to use it. And more are about to come in the
[19:18.320 --> 19:24.360]  FUJ blog with a new blog. So finally, just to acknowledge the projects that they have
[19:24.360 --> 19:51.840]  supported our research in the University of Manchester. And I'm ready for questions.