We want to stick to time. We've been doing very well. So we want to keep that up. The next talk is about, oh sorry, your name is on the slide. That doesn't help me. Tim, yeah, the next talk is about Tim, about extracting mini apps for HPC software. Then, thank you for having me back at Fostum. I am Tim. I'm by trade. I'm a compiler guy and I'm giving a talk to the HPC Dev Room and how that came to be. We'll be part of the talk extracting mini apps from HPC software for total cost of ownership, optimized system procurement. And I want to give a quick background about how this project came to be before I actually start to go into the technical details because this project is part of the NHR Association funded TCO, so total cost of ownership project. NHR stands for Nationales Hochleistungsregion. It's the German term for national high performance computing basically. And the NHR is an alliance of computing centers which all have different specialties, but all have a common admission process and have a so-called harmonized computing environment, which means basically that all the clusters at the different locations have very similar scheduling system, very similar file system. So if you manage to get your app work, your program working on one of the HPC systems, you probably will be able to get it working on the other systems as well. And this little blue dot is the is Darmstadt, which is where I'm from. And all those systems are procured at some point. And whenever you do hardware procurement, there are basically is basically the question, well, what hardware do you get? And the simple answer is you get the best of everything, right? So you get the most cores, the fastest RAM, the most efficient power delivering units, fastest most storage, but this is infeasible for all but the largest of HPC computing centers and even those usually struggle. So usually what you do is you go ahead and say, well, we want to get the best performance per dollar we spent. And for this, you need to figure out what's the performance is you get for the dollars. And so you usually use lin pack and stream, which are benchmarking. One is for basically stressing your floating point unit and turning your rack system into a space heater. And the other is memory. And then because you do not want to run only synthetic benchmarks, you use some of the spec HPC suite benchmark and run this and you figure out how much performance do you get basically. And recently there was this push to get more power efficient when you do HPC community. And so there the target to hit is performance per watt. And there again, you basically use lin pack. If you look at the green 500 list of HPC system, they all publish their lin pack score. But this is not actually representative of what the system will cost during its lifetime. It's usually just what's your one time investment cost in procuring the actual hardware. And what you really want to figure out, especially in the case of this distributed national high performance computing association, is the money we spend actually well invested for the use cases our users have. And this is where this total cost of ownership project came to be. Because you do not want to score your procurement only on performance, but you want to have it be a mix of different factors. Of course, you want to have the initial hard and software investment cost as part of this, but you also want to figure in cooling costs because this is one of the main cost drivers today. Put your power in and how to get the heat away that dissipate the heat that you generate. And usually want to have technical and administrative stuff for your HPC system to actually work properly. And then the last thing which is power consumption. And it's not power consumption of your idling system because that is reasonably low, but it's of the job mix you're running. And this job mix is very essential in this whole thing because the job mix is a very user dependent metric. It is what is the system actually being used for. So for example, and this is again referring back to these distributed and slightly specialized computing centers we have in HR. If you do physics simulation, your application might benefit more from faster CPUs with more core counts than compared to if you're using AI workloads where you probably can't get enough accelerator cards for your workload. And what you do is you monitor how your system is used, which is doable. For example, Slurm can do this. And then you figure out that your users are running lamps and grommets and open form your typical HPC software. And then you cannot really give that one to the vendors, can you? Because if you give a big grommet run to your hardware vendor and say, well, you have 48 hours to run this code through, then your vendor will probably not do it. And in an even more extreme case, you have this weird institute like the scientific computing institute where I'm from, which runs one weird A.out executable. They self-compiled with a custom build script. And you cannot give those to the vendors as well. But the problem is that all those HPC applications are large and complex and have different coding and software patterns. But they are the most representative thing you can get about what is actually running on your clusters. And so the idea is if you have some very big and complex HPC system like the one you see is simulated on the right, which has some kind of entry point and then does matrix multiplication and conditioning and heavy output preparation, the thing that actually spends most of the compute cycles is the one in gray. So the matrix conditioning matrix solving. And if you have a so-called mini-app, which is just the gray part and not all the other things around, you might be able to shrink this application significantly. And this mini-app approach was in one actually pioneered by Jan-Patrick Ler, which was the guy who gave the talk before me. So talk about coincidence. And the basic idea is you shrink the size of your application, but keep the computational characteristics. So the computational kernel where actually most of your compute cycles are spent stays the same. And then you just need to add some wrapper function that sets the kernel up. And then to finish, you just need to find some way to graceful terminator program because you can then have time measurements, power measurements on this little part of the actual big program. Great. And so this is why they needed the compiler guy to do this because they wanted in this total cost of ownership project the idea was to have a fully automatic extraction pipeline. And the basic idea for this pipeline was first you analyze the whole program. For this, we used the MetaCG framework. And those of you who happened to be there last year when I gave a completely different talk, we were using MetaCG as well. So it's a tool that's used at our institute quite heavily and allows you to have a representation of how do functions behave according to each other over the whole program. So you can get a whole program call graph. Once you know how all those functions relate to another, you can figure out what is the actual kernel. So where are my compute cycles spent? For this, the intention was to use Pira. The other idea is you just ask a domain expert what's the slow part of your program and they will probably tell you. So this is much more easy usually. And then the actual extraction of the kernel. And for this, we developed the Apex tool, so the app extraction tool. And it's a clang front and based compiler tool that does source code manipulation. And the basic idea is you query the so-called AST. You do not need to know how you get the AST, what an AST is. The only thing that you know an AST is a very, very condensed and information dense form and representation of a single CPP file. So if you have this CPP file on the right, which only contains the main function, you get the thing on the right. Admittedly, this is very much shortened. Where you can then find your record declaration, you find your structs, and you find your assignments, and you find your function calls. So what you then do is you can query this AST for your information to figure out how these function behave. So if you want to track the kernel, so you already know which part of the program you want to extract, you find all the functions that are used for this kernel. So your kernel might call some subroutines, you want to extract all those subroutines as well. And sadly, the AST is unable to provide us this information because we only can extract when we have the definition, so the body of the function. And as we are only limited to one CPP file with our AST information at a time, we only have to the declaration in this case, which was part of the header. So the print as function in our example is only declared, it's not defined, we have no body there. So if we have the whole body of a function, we get it as a whole text block. If we only have a definition, we remember that we need that one, and we extract it once we actually find the source file that contains the definition for this function. What we can do is we can find and extract all the used globals, because you usually rely on some kind of struct definition, you might even be using global state. And this is where the AST has the information. The whole definition of our struct S was inside the header files, and the header files are included by the preprocessor, though they are part when the AST is built. Great. So we just extract those as a text block, and then we need to find all include statements because include is the last colored thing in our example. And then we run into a little problem. Because remember how I told you that it's great that the preprocessor put the header files into our source code, well, include statements are also handled by the preprocessor. So everything that was specified in this include statement header is put physically in the AST once it's built, and we do not have the information anymore. So what we need to, and this is also true for defines and if and defs and pragmas, all those are resolved by the preprocessor, and we do not have any way to really figure this out once we get to the AST level. So we do not only need to hook into the AST, but we also need to hook into the preprocessor. Those of you who have actually worked with a preprocessor might know that the preprocessor is basically doing copy paste. So it's not context sensitive, it just takes include files, puts it where the include statement was. So we somehow need to map this context insensitive analysis results we get from our preprocessor hooks to figure out how do those relate to the context sensitive information that we get from our AST. And the only thing that those two share in common is source file locations. The preprocessor knows in which source file line it currently is when it does its copy pasting, and the AST can map back to the original source file. So what we do is if we now go to a more realistic example, this is an excerpt from the Lulish code, you do not, it's heavily shortened so there's no way to figure out what it does exactly, but we are mostly interested in the things colored yellow, which means it starts with an open MP, if open MP is available statement and then it includes the actual header and the preprocessor gives us all this information. Whenever it encounters one of these statements, we get a callback that tells us we found an if open MP, it goes from line one to line three, and then we found an include statement in line two, and then it's on us to figure out that line one to three fully encompasses the statement we found in line two, because again we're only doing text block extraction. So this is the conflicts that happen inside the preprocessor, but if we go on the preprocessor also tells us that it found an if open MP statement again in line six to 13, and seven to nine, and 15, but we also have the knowledge that there is a whole function going from line five to line 20, so we need to marry those two informations together as well, and this matching process was one of the challenges that we needed to overcome when we did the kernel extraction. So when we started this whole process, we had a very good idea how to do single translation unit C code, and we expanded on this to allow for multi-translation unit C code and C with C++ components like new and delete and classes, and we're currently working on getting codes that makes heavy use of templates working, because the problem once you come into templates is that if you think back about our analysis step, we only get information about functions and how those functions relate to another, and templates are in a compiler speaking sense not necessarily functions, they are descriptions of how functions will be generated at compile time, so our analysis is currently not offering us the information about the original template, but only about the instantiated templates as generated by the compiler, so we are currently working on getting templates to work, and if you think back about the global usage analysis we are doing, if you have complex class inheritance and polymorphism, we are currently not able to traverse all possible diamond inheritance hierarchical implementations that are possible in C++, and lastly the idea is to also allow for automatic check pointing, so the wrapper calls that need to be generated to set up the environment for the kernel to run, it is theoretically possible to fully automatically generate those wrapper calls, we just haven't looked into it, and lastly the thing we are very skeptical if we are ever able to do it is to just mini extract from every C++ code ever written ever, because there are so many things you can do in C++ that we can try to achieve this, but I am very skeptical if we ever will be able to do this, but I don't want to leave you on this kind of depressing note actually, even in a state like this where the tool cannot fully handle all templates, even in a state where the tool cannot handle most complex inheritance hierarchies, tool assisted mini abstraction can still be useful, for example if you are willing to include the templates manually, because your program won't compile with the templates, you can just copy paste them, then you can get mini abstraction to work right now, and if you are interested in doing pinpoint optimizations on your source code, you can extract only those small snippets of code that you actually intend to work on, optimize those, and then do manual optimization and reintegrate those easily, so there are uses even for a tool that is not able to handle every C++ code ever written, and if you know of any HPC code that you think has a kernel that is identifiable and maybe not using the most and the deepest inheritance hierarchies, let me know, because I am always interested in figuring out how well my tool performs on other codes, so with this, thank you for your attention, I hope it was kind of interesting and I am open for questions. Any questions? I am the author of a large sparse matrix library, do you have something similar already in your catalog or collection? So it would be interesting to apply this tool on a library, but usually when the thing we are doing is we have the whole HPC software and then you call into the library, so of course your library is probably doing the heavy lifting and therefore probably doing the kernel part of the program, but extracting this one, we can look into that, but extracting a call to a library is relatively speaking very easy, so programs whose basic structure is do some setup, call an HPC library, get input back, those are basically mini-apps in the sense that we are talking about because they are not doing most of the heavy computing themselves, but if your library has internal conditioning or matrix solving capabilities that you know of that your library struggles to do, then we are talking again, so just let me know the name of your library and I will try to look into it. Four questions? Alex was first. Hi. I was wondering your mini-apps seem to be focused on compute-intensive parts of the code, do you also construct mini-apps for storage-intensive applications or something else? So the automatic identification wire, the PIRRA tool, tries to figure out what is the compute- intensive part of the program, and this is the only kernel, so to speak, that we are able to automatically identify, but if you as a domain expert know that this is the part of our program where we are IO limited, then this is nothing PIRRA can identify, but if you say I want to extract starting from this function, our tool should in theory be able to extract the IO limited part of your code. So this is the point where you as a domain expert need to specify this is the part I want to extract, because the only thing that is very prominently identifiable is compute-limited parts of the program, but yeah, in theory it should work. And second question, I might allow, and do you have a library of mini-apps that are ready to use for others by third parties? So currently we're doing our benchmark, our benchmarks on already existing mini-apps, so we're doing mini-app extraction from mini-apps because I am profiting from the small size of those mini-apps to validate that my program actually runs, for example the Lulash example I showed is a mini-app in itself. If I remember correctly, it's a shock simulation in fluids, please don't quote me on that, but it's a great code and it's very easy to work with, so I'm using that for my evaluation, but the idea is to get it to work on larger codes, for example we're currently looking to the ISSM ice sheet and system model, which is a well ice melting simulation for large ice sheets, but yeah we're always looking for other codes, and if you have something that is IO bound then of course tell me. Thank you. Plenty of time for more questions, Chris. Sorry to be that guy, but how hard would it be to adapt this approach to Fortran? Fortran tooling in general is something that has been of interest at our institute for a long time, but the problem with Fortran tooling is that most of our knowledge is coming from the Klang front and so the C language front end, and I am not sure if Fortran, the current new Fortran LLVM front end, offers the same analysis and query capabilities as Klang, and the idea to move lower in the hierarchy towards the LLVM IR, which is more target agnostic, or language agnostic more really, is that as soon as you go down to IR it's very hard to go back to figure out what was the source code files that actually made up this IR. So yes we already have in the back of our mind that there are other languages that are used in HPC systems, and usually if I present this approach I'm getting asked, well we have some Fortran codes, we have some Python codes, how does your approach work, but we are sadly limited by the Klang front end's capabilities, so C++, Objective C, which we have never tried, I'm not saying that we are able to do Objective C, but C and C++ we have tried and are currently limited to because of our design choice. There is a Fortran front end for Klang. Yes, but this is a Fortran front end for the LLVM infrastructure. The C-Lang front end is the part of the LLVM project which takes C code and translates to the LLVM IR, so I'm not sure if we're talking about the same thing. There is a Fortran front end for LLVM, yes, but I would be very surprised if there is a way to translate Fortran code into something that Klang can understand, but I might be wrong, there is a myriad of interesting software repositories, but currently consider Miscoptic. Then you can use this small part of your code to stress the floating point unit only of linear hardware. Then this other part is very well vectorized, so we extract this one and suddenly you are able to use AVX 512 instructions. The idea is to extract every code intensive part into its own little package and then benchmark with those separate packages, so at the end you can get an idea about the whole performance of your as your individual kernel regions. This would be the approach I take. Maybe one more? In the back, okay. Great talk, thank you so much. A couple of questions, Chris. I believe there was a paper about a Fortran mini app extractor some time ago. I can dig that up and send it to you. If I remember that, if not, shoot me a message. Then the flank front end in the LLVM project currently is not compatible with Klang, so we do get different ASTs. This approach is actually working at the AST level, so if it were using the LLVM IR level and then somehow like magically map back dwarf, then it would work, but there was a different project that did that, it worked kind of okay. I would have a question about you for the complex inheritance hierarchies. Do you have any idea on how you could tackle that, approach that, represent this thing across the whole program or things so, I mean, did you have spent any time so far looking at that or did you say like okay, that's future me or future someone going to do that? So thank you for the question. So it's a mix of both. I spent some time thinking about it and decided it was for future me because I didn't assume it to be very easy, but you already mentioned the general idea that as soon as you go into more complex inheritance chains, you aren't able to extract everything from one header file per se, so you need to do the same opportunistic extraction idea that we do for functions, but now for classes, structs and all their possible inheritance parents. So this is something that we need to analyze on the whole program scale, so this is something that in the not near future, but in the foreseeable future, I intend to put as an analysis pass into the CG collector, which you probably are familiar with. So the idea is that this tool is then able to annotate all this information as metadata and then once we merge it, we get a very good, hopefully, impression of how those inheritance chains flow through the whole program. So the idea. Thanks. Okay, that's all we have time for. Thanks a lot, Tim.