[00:00.000 --> 00:10.360] Thank you for the opportunity to present our project, Namba MPI. [00:10.360 --> 00:12.760] Let me first acknowledge the co-authors. [00:12.760 --> 00:18.880] My name is Sylvester Arrabas and we are here with Olex Ibulenok and Kacper Darlatka from [00:18.880 --> 00:24.200] Jagiellonian University in Kraków, Poland, Maciej Manna from the same university contributed [00:24.200 --> 00:28.680] to this project and we have also, we will be presenting some work from David Zwicker [00:28.680 --> 00:33.320] from Max Planck Institute for Dynamics and Self-Organisation in Göttingen. [00:33.320 --> 00:40.960] So let's start with a maybe controversial, provocative question, Python and HPC. [00:40.960 --> 00:48.320] And let's try to look for answers to this question in a very respected journal, okay? [00:48.320 --> 00:53.120] So maybe you have some guesses what's written there. [00:53.120 --> 00:59.720] 2019, in scripting languages such as Python, users type code into an interactive editor [00:59.720 --> 01:00.840] line by line. [01:00.840 --> 01:03.400] It doesn't sound like HPC. [01:03.400 --> 01:08.600] Next year, level of computational performance that Python simply couldn't deliver. [01:08.600 --> 01:14.520] Same year, same journal, Namba runs on machines ranging from embedded devices to the world's [01:14.520 --> 01:21.600] largest supercomputers with performance approaching that of compiled languages. [01:21.600 --> 01:23.840] Same year, nature astronomy. [01:23.840 --> 01:27.400] Astronomers should avoid interpreted scripting languages such as Python. [01:27.400 --> 01:32.480] In principle, Namba and Nampa can lead to enormous increase in speed, but please reconsider [01:32.480 --> 01:36.080] teaching Python to university students. [01:36.080 --> 01:38.320] Same year, nature methods. [01:38.320 --> 01:44.000] Implementing new functionality into SciPy, Python is still the language of choice. [01:44.000 --> 01:50.000] Full test should pass with the PyPy just-in-time compiler as of 1.0 SciPy. [01:50.000 --> 01:52.600] Are they talking about the same language? [01:52.600 --> 01:54.080] No. [01:54.080 --> 01:56.880] The left-hand side are papers about Rust and Julia. [01:56.880 --> 01:59.720] The right-hand side are papers about Python. [01:59.720 --> 02:01.880] So maybe that's the reason. [02:01.880 --> 02:09.000] So just to set the stage, let me present, I think, a way that is apt for thinking about [02:09.000 --> 02:10.000] Python. [02:10.000 --> 02:16.760] So Python as a language lacks any support for multi-dimensional arrays or number crunching [02:16.760 --> 02:21.680] because it leaves it to packages to be handled. [02:21.680 --> 02:27.680] Python also leaves it to implementations to actually interpret its syntax. [02:27.680 --> 02:32.140] And SciPy, of course, the major, the main implementation, but it's not the only one [02:32.140 --> 02:37.640] and actually solutions exist that streamline, for example, just-in-time compilation of Python [02:37.640 --> 02:38.640] code. [02:38.640 --> 02:46.080] Moreover, Nampa, while de facto standard, is not the implementation of the Nampa API. [02:46.080 --> 02:52.760] And alternatives are embedded in just-in-time frameworks, just-in-time compilation frameworks, [02:52.760 --> 02:58.200] GPU frameworks for Python, and they leverage typing and concurrency. [02:58.200 --> 03:04.160] So probably here the highlight is that Python lets you glue these technologies together [03:04.160 --> 03:09.600] and package them together, leveraging some of the Python ecosystem and its popularity, [03:09.600 --> 03:11.800] et cetera. [03:11.800 --> 03:15.360] And probably, arguably, I would say that's an advantage. [03:15.360 --> 03:19.800] I'm not saying that please use Python for HPC instead of Julia. [03:19.800 --> 03:25.520] Probably vice versa, actually, but still it's an interesting question to see how it can [03:25.520 --> 03:26.520] perform. [03:26.520 --> 03:30.080] OK, so let's check it. [03:30.080 --> 03:36.320] I will present a brief benchmark, a very tiny one, that we have come up with in relation [03:36.320 --> 03:39.080] with this project, and it uses Nampa. [03:39.080 --> 03:44.800] Nampa is just-in-time compiler that translates a subset of Python and Nampa into machine [03:44.800 --> 03:50.200] code that is compiled at runtime using LLVM, OK? [03:50.200 --> 03:55.560] So here is the story about the super simple benchmark problem. [03:55.560 --> 03:58.320] It's related to a numerical weather prediction. [03:58.320 --> 04:05.640] So you can imagine a grid of, well, numbers representing weather here. [04:05.640 --> 04:09.920] And numerical weather prediction, or part of numerical weather prediction, the integration [04:09.920 --> 04:17.120] part involves solving equations for the hydrodynamics that is the transport of such pattern in [04:17.120 --> 04:22.760] space and time, and, of course, thermodynamics that tell you what's happening in the atmosphere. [04:22.760 --> 04:23.960] Super simplified picture. [04:23.960 --> 04:30.280] I'm not saying that's the whole story about NWP, but for benchmarking Nampa, let's simplify [04:30.280 --> 04:34.200] it down to, in this case, two-dimensional simple problem. [04:34.200 --> 04:37.520] You have a grid, x, y, some signal. [04:37.520 --> 04:43.520] And if we look at just the transport problem, a partial differential equation for transport, [04:43.520 --> 04:49.000] we can see what happens if we move around such signal, which could be some, I don't [04:49.000 --> 04:52.520] know, humidity, temperature, whatever, in the atmosphere, OK? [04:52.520 --> 04:55.440] So we have a sample problem. [04:55.440 --> 05:01.360] Here I'm showing results from a three-dimensional version of what was just shown. [05:01.360 --> 05:07.120] And let's start with the right-hand side plot, x-axis, the size of the grid. [05:07.120 --> 05:10.520] So if it's 8, it means 8 by 8, super tiny. [05:10.520 --> 05:18.800] If it's 128, it's 128 by 128 by 128, and wall time per time step on the y-axis, OK? [05:18.800 --> 05:19.800] Green. [05:19.800 --> 05:26.400] C++ implementation of one particular algorithm for this kind of problems, and orange, pi [05:26.400 --> 05:32.480] MP data, the same algorithm, numerically, but a Python implementation. [05:32.480 --> 05:42.280] So here you see that actually Namba, just compiled version outperformed C++, maintaining [05:42.280 --> 05:48.960] even better scaling for the tiny matrices, but they are kind of irrelevant for the problem. [05:48.960 --> 05:53.080] And please note that in both cases we have used multi-threading. [05:53.080 --> 05:58.400] So here on the left-hand side, you can see actually on the x-axis number of threads, y-axis [05:58.400 --> 06:00.720] wall time per time step. [06:00.720 --> 06:04.080] And again, the green line is the C++ implementation. [06:04.080 --> 06:10.000] These two are two variants of the Python 1.jit compiled with Namba, almost an order of magnitude [06:10.000 --> 06:12.680] a faster execution, five times faster. [06:12.680 --> 06:19.040] And what's probably most interesting for now is that when you compare with just setting [06:19.040 --> 06:26.900] the environment variable for Namba.jit to disabled, we jump more than two orders of [06:26.900 --> 06:31.240] magnitude up in wall time. [06:31.240 --> 06:39.980] So this is how Namba timing compares with plain Python timing. [06:39.980 --> 06:43.260] But there are two important things to be mentioned here. [06:43.260 --> 06:50.540] The Python package is written having Namba in mind, that is, everything is loop-based, [06:50.540 --> 06:56.980] which is the reason why plain C Python with Nampa performs badly. [06:56.980 --> 07:01.220] This line is kind of irrelevant, just as a curiosity. [07:01.220 --> 07:06.340] On the other hand, the C++ version is kind of legacy, it's based on Blitz++ library. [07:06.340 --> 07:10.540] Back then, when it was developed, IGN didn't have support for multiple dimensions. [07:10.540 --> 07:16.860] And it's object-oriented RIProcessing, which was reported and measured to be kind of five [07:16.860 --> 07:21.300] times slower than 4777 for these kind of small domains. [07:21.300 --> 07:24.220] It's not the same for larger domains. [07:24.220 --> 07:28.700] Anyhow, we can achieve high performance with Python. [07:28.700 --> 07:31.100] But what if we need MPI? [07:31.100 --> 07:35.380] We need message passing in our code. [07:35.380 --> 07:36.980] How would we use it? [07:36.980 --> 07:42.700] Let's say we divide in a domain that can position spirit our domain in two parts. [07:42.700 --> 07:49.460] So the same problem, same setup, just half of the domain is computed by one process or [07:49.460 --> 07:59.260] node or anything that has distributed, has different memory addressing than another work. [07:59.260 --> 08:02.380] So this is how we want to use it, why we want to use MPI? [08:02.380 --> 08:07.700] Well, because despite expansion in parallel computation, both in the number of machines [08:07.700 --> 08:12.420] and the number of cores, no other parallel programming paradigm has replaced MPI. [08:12.460 --> 08:15.060] At least as of 2013. [08:15.060 --> 08:21.340] And already in 2013, people were writing that this is, even though it's universally acknowledged [08:21.340 --> 08:24.300] that MPI is rather a crude way of programming these machines. [08:24.300 --> 08:27.020] Anyhow, still, let's try it. [08:27.020 --> 08:29.060] And let's try it with Python. [08:29.060 --> 08:37.180] So here is a seven-line snippet of code where we try to import Namba to get the jit compilation [08:37.180 --> 08:38.780] of Python code. [08:38.780 --> 08:43.300] And then we use MPI for pi, which is Python interface to MPI. [08:43.300 --> 08:44.300] What do we do? [08:44.300 --> 08:48.540] We define some number crunching routine, and we try to use MPI from it. [08:48.540 --> 08:51.340] And then we try to Njit. [08:51.340 --> 08:58.820] Njit means the highest performance variant of Namba jit compilation. [08:58.820 --> 09:02.820] We try to jit compile this function and straight ahead execute it. [09:02.820 --> 09:03.820] What happens? [09:03.820 --> 09:06.500] It doesn't work. [09:06.500 --> 09:14.780] It cannot compile because Namba cannot determine type of MPI for pi.mpi.intra.com because it's [09:14.780 --> 09:16.420] a class. [09:16.420 --> 09:22.100] Classes do not work with Namba, at least not the ordinary Python classes. [09:22.100 --> 09:24.100] So something doesn't work. [09:24.100 --> 09:29.300] So the problem is that we have Namba, which is one of the leading solutions to speed up [09:29.300 --> 09:30.300] Python. [09:30.300 --> 09:36.620] MPI, which is clearly the de facto standard for distributed memory parallelization. [09:36.620 --> 09:40.500] We try to work with them together, but it doesn't work. [09:40.500 --> 09:43.780] So stack overflow. [09:43.780 --> 09:45.940] Let's go it. [09:45.940 --> 09:46.940] Nothing. [09:46.940 --> 09:48.140] Let's quant it. [09:48.140 --> 09:49.780] Nothing. [09:49.780 --> 09:51.460] Wrong search engine, right? [09:51.460 --> 09:53.900] Someone must have solved the problem. [09:53.900 --> 09:54.900] Nothing. [09:54.900 --> 09:58.380] Let's ask Namba guys and MPI for pi guys. [09:58.460 --> 10:03.660] In 2020, you will not be able to use MPI for pi's siton code. [10:03.660 --> 10:05.980] It was not designed for such low-level usage. [10:05.980 --> 10:09.500] Well, okay, it's siton. [10:09.500 --> 10:12.820] But I mean, it must be doable, right? [10:12.820 --> 10:16.100] We have two established packages. [10:16.100 --> 10:18.980] The aim is kind of solid and makes sense. [10:18.980 --> 10:22.060] So it must be doable. [10:22.060 --> 10:28.340] And 30 months later, 120 comments later, 50 PR slater from five contributors on a totally [10:28.340 --> 10:32.500] unplanned site project, we are introducing Namba MPI. [10:32.500 --> 10:42.460] Namba MPI is an open source, kind of small Python project, which allows you to, let's [10:42.460 --> 10:48.340] jump here to the Hello World example, which allows you to use the Namba NGIT decorator [10:48.340 --> 10:57.060] on a function that involves rank, size, or any other MPI API calls within the Python [10:57.060 --> 10:59.260] code. [10:59.260 --> 11:03.860] As of now, we cover size rank, send, receive, or reduce broadcast barrier. [11:03.860 --> 11:07.340] The API for Namba MPI is based on NumPy. [11:07.340 --> 11:09.420] We have auto-generated documentation. [11:09.420 --> 11:12.500] We are on PyPy and Conda Forge. [11:12.500 --> 11:15.540] Few words about how it's implemented. [11:15.540 --> 11:22.260] Essentially we start with Ctypes built into Python to try to address the C API. [11:22.260 --> 11:26.780] There are some things related with passing addresses, memories, void pointers, et cetera, [11:27.620 --> 11:31.220] super interesting. [11:31.220 --> 11:36.580] Probably the key message here is that we are offering the send function that is already [11:36.580 --> 11:41.460] NGITed, which means that you can use it from other NGITed functions. [11:41.460 --> 11:46.540] We handle non-continuous arrays from NumPy, so we try to be user-friendly. [11:46.540 --> 11:53.540] We then call the underline C function, and kind of that's all. [11:53.540 --> 11:57.860] But really, there is the key line number 30. [11:57.860 --> 11:58.860] This one. [11:58.860 --> 12:04.380] Well, that's nothing but, in principle, without it, Namba optimizes out all our code. [12:04.380 --> 12:09.900] Anyhow, these are kind of things that you see when trying to implement such things. [12:09.900 --> 12:14.620] Unfortunately, there are quite more of such hacks inside Namba MPI. [12:14.620 --> 12:20.820] The next slide is kind of a thing that you prefer to never see, but they cannot be unseen, [12:20.820 --> 12:22.780] in a way, if you work with it. [12:22.780 --> 12:28.940] So please just think of it as a picture of some problems that we have challenged and [12:28.940 --> 12:36.340] essentially wrote to Namba guys asking how can it be done, and we got this kind of tools [12:36.340 --> 12:44.860] for handling void pointers from C types in Namba with Python, NumPy, et cetera. [12:44.860 --> 12:51.700] But well, that's utilspy, and that's it, and it kind of works, and why do we know it works? [12:51.700 --> 13:00.500] Because we test it, and let me handle the mic to Olexi to tell you more about testing. [13:00.500 --> 13:10.660] Okay, it's focused. [13:10.660 --> 13:18.180] So I'm going to tell you about the CI that we have set up for our project for Namba MPI. [13:18.180 --> 13:34.860] So the CI was set up at Github Actions, as I said, and this is the screen of the workflow. [13:34.860 --> 13:40.060] We start from running the PDoc, Precommit, and PyLint. [13:40.060 --> 13:46.580] PDoc is for automatic documentation generation, PyLint for static code analysis, and Precommit [13:46.580 --> 13:49.140] for styling. [13:49.140 --> 13:53.540] After that, if these steps were successfully moving to the main part where we run our unit [13:53.540 --> 14:03.420] tests, this is the example, not example, but actually the workflow file that we run. [14:03.420 --> 14:10.140] As you can see, when we run the CI against multiple systems, different Python versions [14:10.140 --> 14:16.540] and different MPI implementations, and here we should say a big thank you to MPI for [14:16.700 --> 14:25.300] PyTeam for providing set up MPI Github Action, because this has saved us a lot of time. [14:25.300 --> 14:29.260] So thank you, MPI for Py. [14:29.260 --> 14:36.620] And as of operation systems and MPI implementations, we are running, in case of Linux, we're testing [14:36.620 --> 14:44.100] against OpenMPI, MPICH, and Intel MPI, Mac OS, MPI, and MPICH, and in case of Windows, [14:44.340 --> 14:48.180] of course, MSMPI implementation. [14:48.180 --> 15:01.300] But when we are talking about MPICH, there is a problem that has recently occurred, namely [15:01.300 --> 15:07.940] starting from version 4 of MPICH, it fails for on Ubuntu on our CI for Python version [15:07.940 --> 15:10.660] less than 3.10. [15:10.660 --> 15:18.020] So if anyone has ideas how to fix it, please contact us, we will appreciate any help. [15:18.020 --> 15:26.740] Okay, so sample, we are running the unit tests on different systems and so on. [15:26.740 --> 15:29.060] Let's see the sample unit test. [15:29.060 --> 15:40.220] In this test, we are testing the logic of the wrapper of the broadcast function of MPI [15:40.420 --> 15:46.420] and the main thing that you should remember from this slide is that we are testing this [15:46.420 --> 15:56.340] function in plain Python implementation as well as Github compiled by Namba. [15:56.340 --> 16:04.940] We have also set up an integration test, the integration test is in another project named [16:04.940 --> 16:10.900] isopredoperate-les, and this is just a scheme of this test. [16:10.900 --> 16:19.620] We are starting from providing the initial conditions for the APDS solver, and these initial [16:19.620 --> 16:27.060] conditions are written to the HDF5 file. [16:27.060 --> 16:34.460] After that, we are running three runs, the first one we run with only one process, the [16:34.460 --> 16:41.340] second we have two processes, the third three processes, and in each we divide, well, in [16:41.340 --> 16:50.020] the first we don't divide the domain, but the other ones we divide the domain accordingly, [16:50.020 --> 16:57.820] and in the assert state we just compare the results and we want the results to be the [16:57.820 --> 17:00.740] same for different runs. [17:00.740 --> 17:10.020] And also these results are also written to HDF5 file. [17:10.020 --> 17:19.660] Interesting fact that everything works on Windows except installing HDF5 package for [17:19.660 --> 17:28.780] concurrent file access, HDF5 package was enabled in PIO, we have troubles setting up on Windows, [17:28.780 --> 17:36.940] but everything else works fine, and there is also an independent use case, the PyPD [17:36.940 --> 17:46.020] project that uses our library, our package, and it's not developed by us, so there is [17:46.020 --> 17:52.060] a user, and this is the Python package for solving partial differential equation, it [17:52.060 --> 18:04.340] focuses on finite differencing, and these are defined by, I provide it as strings, and [18:04.340 --> 18:09.220] the solution strategy as follows, we start from partitioning the grid onto different [18:09.220 --> 18:18.540] nodes using number MPI after add that with partial expressions using the SIMPY and compile [18:18.540 --> 18:26.380] the results using number, and then we trade the PDE exchange in boundary information between [18:26.380 --> 18:31.420] the nodes using number MPI. [18:31.420 --> 18:38.980] Take home messages, there is a common mismatch between the Python language and Python ecosystem, [18:38.980 --> 18:45.260] we should remember that the language could be slow, but we also should consider the [18:45.260 --> 18:52.620] ecosystem around this language, the libraries that are available, the libraries that are [18:52.620 --> 19:00.140] available, and probably different implementations, and Python has a range of global HPC solutions [19:00.140 --> 19:07.300] such as just-in-time compilation, GPU programming, multi-trading, and MPI, and in case of number [19:07.300 --> 19:20.300] MPI, this is the package to glue the MPI with LFMG compiled Python code, it is tested [19:20.300 --> 19:34.980] on CI, on GitHub Actions, we are aiming for 100% unit test coverage, and also there is [19:34.980 --> 19:45.260] also already the two projects that are dependent on this package, here you can find the links [19:45.260 --> 19:52.180] for number MPI, the GitHub links, and also the links to the packages at PyPy and Anaconda, [19:52.180 --> 20:00.820] and we also welcome contributions, the first two issues I have mentioned earlier, and we [20:00.820 --> 20:07.660] also welcome and encourage to provide the logo for number MPI, as well as adding support [20:07.660 --> 20:15.780] for the other functions, or we are also aiming for dropping dependency on MPI for Py in our [20:15.780 --> 20:25.460] project, and also the plan is to benchmark the performance of this package, and we also [20:25.460 --> 20:32.180] we wanted to acknowledge funding, the project was funded by National Science Centre of Poland, [20:32.180 --> 20:46.980] so thank you for your attention, and probably we now have time for questions. [20:46.980 --> 20:56.980] Thank you very much, any questions? [20:56.980 --> 20:59.300] Question from an MPI expert? [20:59.300 --> 21:06.940] Hello, thank you for the talk, so the interface you are proposing is very close to the let's [21:06.940 --> 21:12.700] say CMPI interface, let's say when you do a send you work with a buffer, or do you try [21:12.700 --> 21:19.420] to provide a bit higher level interface, for example, serializing some Python object, [21:19.420 --> 21:29.420] or it could be very useful. [21:29.420 --> 21:37.020] Yeah, the interface is as slim thin as possible probably, very close to the CMPI, one of the [21:37.020 --> 21:44.820] reasons being that within Namba and Jitted Code, probably things like serialization might [21:44.820 --> 21:52.860] not be that easy to do, there is no problem in combining MPI for Py and Namba MPI in one [21:52.860 --> 21:58.140] code base, so when you are out of the Jitted Code, you can use MPI for Py, which has high [21:58.140 --> 22:06.300] level things like serialization, et cetera, so you can use it there, but within LLVM compiled [22:06.300 --> 22:13.540] blocks, you can use Namba MPI for simple send, receive, already use, I mean, without higher [22:13.540 --> 22:22.820] level array functioning, having said that we, for example, handle transparently non-contiguous [22:23.020 --> 22:32.420] devices of arrays, we also, yeah, there are some things that are higher level than C interface, [22:32.420 --> 22:37.420] but in general, we try to provide wrapper around the C routines. [22:37.420 --> 22:42.780] Okay, thank you. [22:42.780 --> 22:45.780] Any other questions? [22:46.740 --> 22:55.220] Thanks for a great talk, it seems really interesting what you are working on, I have got a couple [22:55.220 --> 23:00.580] of questions, probably born out of ignorance, but I just kind of wondered if you could help [23:00.580 --> 23:06.140] me with them, so firstly, I was wondering why you went with making a separate package [23:06.140 --> 23:14.740] rather than sort of trying to build this functionality on top of MPI for Py, would it have been possible [23:14.820 --> 23:21.180] to sort of add this, add the feature of making things jit-compilable into MPI for Py, and [23:21.180 --> 23:26.500] secondly, I was kind of wondering with the MPI IO thing that you were looking at with [23:26.500 --> 23:34.060] Windows, if that requires kind of concurrent file access from separate processes in Windows, [23:34.060 --> 23:39.300] is that just a complete, completely a no-go for Windows, because I understand that's something [23:39.300 --> 23:42.180] that Windows kernel doesn't support. [23:42.180 --> 23:43.180] Thank you. [23:43.420 --> 23:46.380] Thanks, let me start from the second one. [23:46.380 --> 23:54.380] So here our, well, essentially it's a fun fact that everything else worked for Windows, [23:54.380 --> 23:59.460] we do not really target Windows, but it was nice to observe that all works, it's kind [23:59.460 --> 24:07.340] of one of these advantages of Python that you code and you don't really need to take [24:07.340 --> 24:12.420] too much care about the targeted platforms, because the underlying packages are meant [24:12.420 --> 24:18.620] to work on all of them, and here everything works with Microsoft MPI, the only thing that [24:18.620 --> 24:25.540] actually was a problem for us was to even install H5py on Windows with MPI support. [24:25.540 --> 24:32.660] So we don't really know what's the true bottleneck, but even the documentation of H5py suggests [24:32.660 --> 24:35.060] against trying. [24:35.060 --> 24:41.820] For the first question, why do we create, why do we develop a separate package instead [24:41.820 --> 24:45.900] of adding it on top of MPI 4Py? [24:45.900 --> 24:55.260] So I think even on the slide with the story of the package, there was a link to, yeah, [24:55.260 --> 25:01.700] there's a link to MPI 4Py issue, the bottom footnote, where we suggested would it be possible [25:01.700 --> 25:10.540] to add it, and in relation to the first question, so probably the scope, the goal of MPI 4Py [25:10.540 --> 25:20.580] is to provide very high level API for MPI in Python. [25:20.580 --> 25:26.420] So with discussing with the developers there, we realized that it's probably not within [25:26.420 --> 25:34.940] the scope of a very high level interface, so we started off with just, well, small separate [25:34.940 --> 25:41.940] project, but I mean, well, great idea, it could be glued together, as of now we aim for dropping [25:41.940 --> 25:48.020] dependency on MPI 4Py, which we now use just for some utility routine, not for the communication [25:48.020 --> 25:55.700] or nothing that is used by Namba, and probably that might be an advantage, because you can [25:56.460 --> 26:02.980] eventually you should be able to install Namba MPI with very little other dependencies, [26:02.980 --> 26:08.700] and Namba MPI is written purely in Python, so installing it, you do not need to have [26:08.700 --> 26:13.820] any Python related C-compiled code, and you can do it quite easily. [26:13.820 --> 26:17.420] Okay, thank you very much.