[00:00.000 --> 00:10.000]  Okay, we're good to get started with one more MPI talk, but I think a very different one
[00:10.000 --> 00:11.000]  compared to the others.
[00:11.000 --> 00:12.000]  Hopefully.
[00:12.000 --> 00:13.720]  Compiler-added MPI correctness check.
[00:13.720 --> 00:14.720]  Yeah.
[00:14.720 --> 00:15.720]  Thank you.
[00:15.720 --> 00:21.120]  So my name is Alexander Hück, and today I'm going to talk about basically the dynamic
[00:21.120 --> 00:23.680]  MPI correctness tool, which is called MUST.
[00:23.680 --> 00:28.200]  And in particular, I'm going to talk about the compiler extension, which is called Type
[00:28.200 --> 00:34.800]  Art, which is supposed to help with MPI type correctness checking.
[00:34.800 --> 00:41.480]  And first of all, as we heard before, the message-passing interface is the defector
[00:41.480 --> 00:46.440]  standard of distributed computations in the HPC world, right?
[00:46.440 --> 00:52.960]  And it defines a large set of communication routines and other stuff, and it's also designed
[00:52.960 --> 00:59.400]  for a heterogeneous cluster system where you have different platforms that communicate
[00:59.400 --> 01:01.400]  and compute something.
[01:01.400 --> 01:07.640]  However, in that sense, it's also very low-level interface where you have to specify a lot
[01:07.640 --> 01:13.400]  of stuff manually, and you can expect only a little error checking in general from the
[01:13.400 --> 01:14.880]  library itself.
[01:14.880 --> 01:22.000]  So the user is required for the simple MPI send operation to specify the data, which
[01:22.000 --> 01:25.120]  is transferred as a typeless void buffer.
[01:25.120 --> 01:32.480]  The user has to specify its data length of the buffer and the user and the type manually,
[01:32.480 --> 01:37.800]  and also the message envelope, so the destination of the message and the communicate and stuff
[01:37.800 --> 01:39.640]  like that has to be specified manually.
[01:39.640 --> 01:45.040]  So there's a lot of opportunity to commit a mistake, basically.
[01:45.040 --> 01:53.560]  And this is quite a question to you guys, if you look at this small code, try to figure
[01:53.560 --> 01:58.040]  out how many errors you can spot in this small example.
[01:58.040 --> 02:02.960]  And just try to look at every corner, basically.
[02:02.960 --> 02:08.640]  And while I'm talking, I can also spoiler you that I'm going to show you every issue
[02:08.640 --> 02:16.760]  in this small example in a couple of seconds, so to speak.
[02:16.760 --> 02:21.840]  When I first looked at it, my colleague Joachim showed me I couldn't find the most simple
[02:21.840 --> 02:26.760]  one that was a bit crazy to me.
[02:26.760 --> 02:33.440]  Sometimes you don't see the forest in front of the trees.
[02:33.440 --> 02:38.480]  OK, so the most basic one, we don't call MPI in it, right?
[02:38.480 --> 02:40.840]  That's usually in MPI applications.
[02:40.840 --> 02:46.720]  That's the very first call you're supposed to do, where you initialize the MPI environment.
[02:46.720 --> 02:51.160]  And then likewise, if you look at the end of the program, we do not call MPI finalize.
[02:51.160 --> 02:54.080]  So those are two simple mistakes.
[02:54.080 --> 02:57.080]  But then in total, we have eight issues.
[02:57.080 --> 02:58.960]  I don't know how many you found.
[02:58.960 --> 03:06.320]  And I'm also not going to talk about each one of them, but it's quite easy to, if you
[03:06.320 --> 03:13.720]  look at each individual issue, to kind of guess that it can happen to you also.
[03:13.720 --> 03:17.000]  And those are the pointers where they are.
[03:17.000 --> 03:22.880]  And in particular, I want to talk about the receive-receive deadlock, where, for instance,
[03:22.880 --> 03:28.920]  two process weights on each other without being able to continue.
[03:28.920 --> 03:33.240]  You can argue that all those issues, except maybe the deadlock, could be found by the
[03:33.240 --> 03:34.560]  MPI library itself.
[03:34.560 --> 03:41.120]  But typically on HPC systems, the library does not do any checking for performance reasons.
[03:41.120 --> 03:48.800]  That's why many of these issues will not be, will cause maybe crashes for unknown reasons
[03:48.800 --> 03:53.160]  or just produce some strange results.
[03:53.160 --> 04:01.280]  Well, that's why the dynamic MPI correctness tool must was developed in the past, which
[04:01.280 --> 04:09.040]  is a tool that during runtime checks for issues and produces such reports where it finds some
[04:09.040 --> 04:10.040]  issues.
[04:10.040 --> 04:16.200]  And this is a report of the deadlock we have seen in the example code, where the message
[04:16.200 --> 04:18.320]  itself just describes there's a deadlock.
[04:18.320 --> 04:26.200]  In the bottom left, you can see a wait for graph, which just shows you which rank waits
[04:26.200 --> 04:31.440]  for another rank causing the deadlock.
[04:31.440 --> 04:37.640]  This helps you to kind of see where the deadlock occurs and why it occurs.
[04:37.640 --> 04:43.680]  And also, must can produce so-called call stack information, where you can see, beginning
[04:43.680 --> 04:49.920]  from main of the program to the basically origin of the deadlock, but this was omitted
[04:49.920 --> 04:50.920]  now.
[04:50.920 --> 04:51.920]  Okay.
[04:51.920 --> 05:00.160]  So, to facilitate correctness checking for MPI, must uses a so-called distributed
[05:00.160 --> 05:04.240]  agent-based analysis, which means that you have your normal MPI application with four
[05:04.240 --> 05:10.320]  ranks, four processes that communicate as you would expect as the user wrote it.
[05:10.320 --> 05:17.480]  But must will also create a analysis network, which helps you to do local analysis, it helps
[05:17.480 --> 05:18.920]  you to do distributed analysis.
[05:18.920 --> 05:23.920]  If you think about a deadlock, you need information for more than one process to figure out that
[05:23.920 --> 05:26.400]  there occurred a deadlock in your program.
[05:26.400 --> 05:33.520]  So must creates that completely transparent to the user, so you would use MPI comworld
[05:33.520 --> 05:39.120]  and any other communicator as normal, must takes care of creating such a network.
[05:39.120 --> 05:47.480]  And also, what's maybe the focus of the talk today is the local analysis, where we look
[05:47.480 --> 05:49.960]  at process local checks.
[05:49.960 --> 05:54.760]  If you think about MPI type correctness of a send operation, you can do a lot of stuff
[05:54.760 --> 06:00.240]  locally, or I should do a lot of stuff locally, and this is the focus.
[06:00.240 --> 06:06.200]  So, MPI type correctness, we focus basically on the buffer and the user-specified length
[06:06.200 --> 06:12.400]  and the user-specified MPI data type today.
[06:12.400 --> 06:17.960]  Must can already detect mismatches of, for instance, the send and receive communication
[06:17.960 --> 06:26.080]  pair, where must basically creates a so-called type map, it looks at the user-specified
[06:26.080 --> 06:30.600]  buffer size and the user-specified data type, and compares it to the corresponding receive
[06:30.600 --> 06:31.600]  operation.
[06:31.600 --> 06:38.320]  If there is a mismatch, obviously, there is going to be an issue, and must creates a report
[06:38.320 --> 06:40.320]  about that.
[06:40.320 --> 06:44.360]  This also, of course, works for collective communications, where you can make sure that
[06:44.360 --> 06:53.000]  all ranks call, for instance, a broadcast operation with the same data type.
[06:53.000 --> 07:01.320]  However, since must only intercepts MPI calls in general, it cannot look behind a
[07:01.320 --> 07:06.160]  device, like it cannot look what happens in user space, you know.
[07:06.160 --> 07:13.640]  So, we cannot reason about the type of the Void buffer data, and this is why we were
[07:13.640 --> 07:23.080]  motivated to create the tool type art, which is something that helps with basically figuring
[07:23.080 --> 07:26.960]  out what the memory allocation is that you put into your MPI calls.
[07:26.960 --> 07:33.440]  So, if you look at this small example on the right side, completely processed locally,
[07:33.440 --> 07:38.040]  there might be some memory allocation in that example, it's a double buffer that was allocated
[07:38.040 --> 07:47.240]  by Melloc, let's say, and the question now becomes, how can we make sure that data, the
[07:47.240 --> 07:52.760]  data buffer, which is a Void buffer, fits the user-specified buffer size, so is it of
[07:52.760 --> 07:58.600]  length buffer size, and it also should be compatible with the MPI float type, and of
[07:58.600 --> 08:04.720]  course, we can already see that double and MPI float, there's a type mismatch, but must
[08:04.720 --> 08:12.880]  cannot answer such a question without further tooling, because it just intercepts MPI calls.
[08:12.880 --> 08:19.200]  Okay, so to just show you that it's not an academic example, there's two well-known HPC
[08:19.200 --> 08:26.600]  benchmark codes, which have some issues, so one was reported in the past by others, where
[08:26.600 --> 08:31.880]  there's a broadcast operation, it uses a big end, which is an alias for a 64-bit data type,
[08:31.880 --> 08:38.640]  however, the user-specified MPI end, which is a 32-bit data type for the broadcast operation,
[08:38.640 --> 08:45.080]  so there's an obvious mismatch, that could be a problem, likely, and also for milk, there's
[08:45.080 --> 08:52.000]  an all-reduced operation, where the user's passed in a struct with two float members,
[08:52.000 --> 08:58.840]  and it's interpreted as a float array of size two, which is B9, to be honest, but that could
[08:58.840 --> 09:03.040]  be a portability issue in the future, maybe, you know, depending on the platform, maybe
[09:03.040 --> 09:07.480]  there's padding, or whatnot, and maybe it's an illegal operation, so this could also be
[09:07.480 --> 09:09.760]  an issue in the future.
[09:09.760 --> 09:16.360]  Well, from a high-level point of view, how it does must work, well, you have your MPI
[09:16.360 --> 09:22.120]  application, and during runtime, it intercepts all the MPI calls, and collects all the states
[09:22.120 --> 09:29.880]  that it's needed for deadlock detection, and so on, and we added type art, which looks
[09:29.880 --> 09:35.560]  at all those allocations that are passed to MPI calls for those local analysis of buffers,
[09:35.560 --> 09:43.320]  which is the compiler extension based on LLVM, so you compile your code with our extension,
[09:43.320 --> 09:48.800]  and the extension instruments all allocations, be it stack, be it heap, which are related
[09:48.800 --> 09:56.080]  to MPI calls, and we also provide a runtime, so during runtime, we get callbacks of the
[09:56.080 --> 10:02.000]  target application, all allocations, all free operations, so we have a state of the allocation
[10:02.000 --> 10:05.960]  of the memory, basically, in a target code.
[10:05.960 --> 10:11.160]  We also, of course, look at the allocations and pass out their type, so simple case is
[10:11.160 --> 10:17.520]  buffer, A is a double type, more complex cases would be structs or classes, we pass the serialized
[10:17.520 --> 10:23.200]  type information to our runtime, which then enables, of course, must to make queries,
[10:23.200 --> 10:29.320]  so for instance, for an MPI center operation, we give the type art runtime the buffer, the
[10:29.320 --> 10:33.960]  typeless buffer, and the runtime would return all the necessary type information to ensure
[10:33.960 --> 10:38.120]  type correctness of those buffer handles.
[10:38.120 --> 10:40.960]  This is the whole high level process behind it.
[10:40.960 --> 10:49.440]  And then if you take a look at an example of a memory allocation, C is a small heap
[10:49.440 --> 10:56.280]  allocation of a float array, this all happens in LLVM IR, I'm just showing C like code to
[10:56.280 --> 11:03.840]  make it easier to understand, we would add such a type art alloc callback, which where
[11:03.840 --> 11:10.120]  we need the data pointer, of course, and then we need a so called type ID, it's just a representation
[11:10.120 --> 11:16.440]  of what we allocated, that is later used for type checking, and of course we need the dynamic
[11:16.440 --> 11:23.520]  length of the allocated array to reason about where we are in the memory space, so to speak.
[11:23.520 --> 11:28.560]  Once we handle stack and global allocations, for stack allocations, of course, we have
[11:28.560 --> 11:34.040]  to respect the automatic scope dependent lifetime properties, and for global we just register
[11:34.040 --> 11:40.000]  once and then it exists at our runtime for the whole program duration.
[11:40.000 --> 11:45.920]  And of course, for performance reasons, you can imagine that the less callbacks the better,
[11:45.920 --> 11:50.240]  hence we try to filter out allocations where we can prove that they are never part of an
[11:50.240 --> 11:56.840]  MPI call and just never instrument those.
[11:56.840 --> 12:06.160]  This is basically possible on LLVM IR by data flow analysis, so in the function foo we have
[12:06.160 --> 12:11.040]  two stack allocations and then we try to follow the data flow where we can see that A is passed
[12:11.040 --> 12:17.480]  to bar, and inside bar there's never any MPI call, so we can just say, okay, we do not need
[12:17.480 --> 12:21.520]  to instrument this, this is discarded.
[12:21.520 --> 12:28.160]  Likewise for foo bar, we can see that B is passed, if it's in another translation unit
[12:28.160 --> 12:36.280]  we would need to have a whole program view of the program, which we support, but other
[12:36.280 --> 12:41.120]  tools have to create such a call graph with those required information.
[12:41.120 --> 12:48.720]  Anyways, so also if we had this view, we can see foo bar also does not call MPI, so both
[12:48.720 --> 12:54.240]  stack allocations don't need to be instrumented, which helps a lot with the performance.
[12:54.240 --> 13:04.400]  Okay, so the type ID which is passed to the runtime for identification works as follows,
[13:04.400 --> 13:10.000]  built-in types are obviously known a priori, so we know the type layout, float is 4 bytes,
[13:10.000 --> 13:15.320]  double is 8 bytes, depending on platform of course, for user defined types, which means
[13:15.320 --> 13:24.160]  structs, classes and so on, we basically serialize it to a YAML file and the corresponding type
[13:24.160 --> 13:30.560]  ID of course, so we can match those during runtime, where we have the extent how many
[13:30.560 --> 13:36.360]  members offsets, byte offsets basically from the beginning of the struct, and also the
[13:36.360 --> 13:44.000]  subtypes are listed, which can then be used for making type queries about the layout and
[13:44.000 --> 13:48.520]  stuff like that.
[13:48.520 --> 13:54.840]  And then of course, must needs to have some API to figure out type correctness, and this
[13:54.840 --> 14:05.080]  is provided by our runtime, which has quite a few API functions, the most basic one would
[14:05.080 --> 14:12.280]  be this type out get type where you put in the MPI buffer handle, and what we put out
[14:12.280 --> 14:18.000]  is the type ID and the error length, and then you can use the type ID subsequently, for
[14:18.000 --> 14:23.280]  instance in this call where you put in the type ID and you get out the struct layout
[14:23.280 --> 14:33.440]  I just mentioned earlier, and this way you kind of can assemble some iterative type checking
[14:33.440 --> 14:38.640]  which is done in must.
[14:38.640 --> 14:46.520]  And then putting it all together, if you want to use our tooling, you would need to first
[14:46.520 --> 14:55.520]  of all compile your program with our provided compiler wrapper, which is a batch script and
[14:55.520 --> 15:00.920]  does the bookkeeping require to introduce the instrumentation, the type out stuff, so
[15:00.920 --> 15:04.880]  you exchange your compiler, that's the first step, it's optional, you don't have to do
[15:04.880 --> 15:11.640]  it if you don't need this local type out checking, and then you would also need to replace your
[15:11.640 --> 15:18.240]  MPI exec or MPI run depending on the system with the must run, which also does some bookkeeping
[15:18.240 --> 15:27.680]  for must to execute the target code appropriately, spawn all the analysis agent based networking
[15:27.680 --> 15:35.760]  and so on, and then the program runs as normal and must output file is generated with all
[15:35.760 --> 15:45.080]  issues found during execution of your program, and as a side note maybe, as I said must does
[15:45.080 --> 15:49.480]  this agent based network and in the most simple case for the distributed analysis, there's
[15:49.480 --> 15:57.320]  an additional process needed for the deadlock detection and so on, so for SLAM or whatnot
[15:57.320 --> 16:03.840]  you need to allocate an additional process, however you don't need to specify it in the
[16:03.840 --> 16:11.160]  must run stuff, it happens automatically in the background, alright, so that's it, if
[16:11.160 --> 16:17.600]  you look now at what the impact is of our tooling, well that's quite dependent as I
[16:17.600 --> 16:23.000]  kind of alluded to, how many callbacks you have, how many memory allocations you actually
[16:23.000 --> 16:29.520]  have to track, and how good we are at filtering them, so here's two examples, Lulech and
[16:29.520 --> 16:36.720]  Tachyon, which are again quite well known HPC benchmarking codes, and Lulech is quite
[16:36.720 --> 16:44.520]  favorable for our presentation because there's not many callbacks and hence our runtime impact
[16:44.520 --> 16:53.200]  is like quite non-existent so to speak, where you can see that this is compared to vanilla
[16:53.200 --> 17:00.920]  without any instrumentation without our tooling, type art almost has no impact, and then with
[17:00.920 --> 17:09.200]  type art analysis enabled has almost no additional impact, for Tachyon the picture looks quite
[17:09.200 --> 17:15.120]  different as you can see, there's an overhead factor of about three using when you introduce
[17:15.120 --> 17:21.840]  type art, this is because there's a lot of stack allocations that cannot filter, so we
[17:21.840 --> 17:29.120]  track a lot of stack allocations and the runtime impact is quite high, and this is reflected
[17:29.120 --> 17:36.400]  by those runtime and static instrumentation numbers, so first of all the buff table here
[17:36.400 --> 17:43.960]  shows you during compilation what we instrument, so you can see that there's some heap free
[17:43.960 --> 17:49.040]  operations that we find an instrument, there's some stack allocations and globals that we
[17:49.040 --> 17:57.000]  instrument, well of course those numbers do not represent the runtime numbers because
[17:57.000 --> 18:02.840]  heap and free operations sometimes are written in a way that they are like centralized in
[18:02.840 --> 18:13.160]  a program, that's why those numbers are not as high as you would expect, for stack allocations
[18:13.160 --> 18:20.480]  we find 54 and out of those 54 we can filter for Lulish at least 21%, and globals are much
[18:20.480 --> 18:25.040]  easier to follow along the data flow in LLVM IR so we can filter much more and much more
[18:25.040 --> 18:32.680]  effectively, well going to the runtime numbers which means that those are basically the
[18:32.680 --> 18:40.400]  number of callbacks that happen during our benchmarking, we can already see that the
[18:40.400 --> 18:50.360]  high overhead of which we observed in Tachyon is to be explained by the almost 80 million
[18:50.360 --> 18:54.760]  stack allocation callbacks basically that we have to track during runtime, which is
[18:54.760 --> 19:05.000]  a lot of context switching and so on, which is not good for the runtime, alright so this
[19:05.000 --> 19:12.080]  is already my conclusion, what we have done is basically with type art must can now check
[19:12.080 --> 19:18.880]  all phases of the MPI communication with respect to type correctness, so the first phase that
[19:18.880 --> 19:24.360]  must can already do is this one, which is basically the message transfer, this is checked
[19:24.360 --> 19:29.320]  against, however there is also the phase of message assembly, right where you go kind
[19:29.320 --> 19:36.200]  of from the user process into the MPI process and you have to check this, and of course
[19:36.200 --> 19:40.760]  if you think about it you would also have to kind of check the message disassembly where
[19:40.760 --> 19:51.160]  you go from the received data to your user program again, so type art enables these kind
[19:51.160 --> 20:05.640]  of local checks to ensure type correctness, thank you very much.
[20:05.640 --> 20:22.720]  Any questions?
[20:22.720 --> 20:27.400]  Yeah so I really like to talk, I thought it was really interesting, so one thing I wanted
[20:27.400 --> 20:36.800]  to ask was how does one get must, like how do they install it, is it available for distribution
[20:36.800 --> 20:40.520]  package managers or is it more that you have to compile it yourself?
[20:40.520 --> 20:48.080]  Good question, I think you have to compile it yourself, even on our HPC system so, but
[20:48.080 --> 20:56.280]  it's not that tedious to compile I think, maybe I'm biased, but just go to the website
[20:56.280 --> 21:03.640]  and there's a zip file, it includes every dependency that you need and I think the documentation
[21:03.640 --> 21:09.680]  is quite straightforward, you need of course maybe open MPI installed, but not much more
[21:09.680 --> 21:16.040]  to be honest and then you should be good to go, yeah, I think it's CMAC based, I don't
[21:16.040 --> 21:22.840]  know if you have problems with that, but yeah, it should be straightforward to try it out.
[21:22.840 --> 21:38.440]  Thank you, another question there on my way.
[21:38.440 --> 21:45.480]  So on the type analysis that you do, I mean if you look at malloc and it has like a type
[21:45.480 --> 21:49.480]  cast then you know what the type is, but if it doesn't have a type cast, if you malloc
[21:49.480 --> 21:53.840]  into a void pointer and if the amount of bytes you are allocating comes from some constant
[21:53.840 --> 21:59.480]  or macro or some argument, how far do you follow and if you can't see it, do you have
[21:59.480 --> 22:03.040]  a warning, do you crash?
[22:03.040 --> 22:08.440]  That's a good question and that's basically a fundamental problem, right, so we have to
[22:08.440 --> 22:16.320]  have some expectations of the program, right, so our expectation is that the malloc calls
[22:16.320 --> 22:29.080]  are typed, otherwise we would just track it as a chunk of bytes and I think our analysis
[22:29.080 --> 22:37.960]  is quite forgiving, so we would just look at okay this is a chunk of bytes, it fits you
[22:37.960 --> 22:40.520]  know the buffer and this is fine.
[22:40.520 --> 22:53.360]  Yes, you kind of lose that, right, if you just know it's a chunk of bytes then you kind
[22:53.360 --> 23:00.880]  of lose the alignment checks because you could, if you have like say you malloc is struct
[23:00.880 --> 23:07.720]  and then you do some pointer magic for your MPI buffer and you point between members in
[23:07.720 --> 23:16.840]  the padding area, only if type art knows about the malloc struct, it can of course warn that
[23:16.840 --> 23:23.040]  you are doing some illegal memory operations, if we just see a void pointer due to the type
[23:23.040 --> 23:38.560]  plus malloc then we have lost basically, anyone else, do you have any thoughts on using Rust
[23:38.560 --> 23:44.600]  which is a way more memory safe language than C and C plus pluses, have you looked at it?
[23:44.600 --> 23:53.920]  Not really, not yet, for now we have so much to do with the C and C plus words to support
[23:53.920 --> 24:01.920]  typing better, to get more robustness and so on and not yet to be honest.
[24:01.920 --> 24:04.920]  Maybe all that work becomes irrelevant if Rust gets popular enough.
[24:04.920 --> 24:12.280]  I think in general maybe I'm completely like a newbie when it comes to Rust, I think the
[24:12.280 --> 24:20.360]  MPI support itself is still in the works, I read some papers about like generating bindings
[24:20.360 --> 24:30.360]  for MPI which are inherently type safe, not sure how that goes.
[24:30.360 --> 24:36.300]  I think everyone will be happy if Rust or some other type safe language becomes more
[24:36.300 --> 24:43.080]  used by people and this kind of work is irrelevant, but while people still use C plus pluses,
[24:43.080 --> 24:44.800]  this is very relevant.
[24:44.800 --> 24:49.720]  That pays my bills.
[24:49.720 --> 24:50.720]  Thank you very much.
[24:50.720 --> 25:07.760]  Thank you.