We want to stick to time. We've been doing very well. So we want to keep that up. The
next talk is about, oh sorry, your name is on the slide. That doesn't help me. Tim, yeah,
the next talk is about Tim, about extracting mini apps for HPC software.
Then, thank you for having me back at Fostum. I am Tim. I'm by trade. I'm a compiler guy
and I'm giving a talk to the HPC Dev Room and how that came to be. We'll be part of the talk
extracting mini apps from HPC software for total cost of ownership, optimized system procurement.
And I want to give a quick background about how this project came to be before I actually
start to go into the technical details because this project is part of the NHR Association
funded TCO, so total cost of ownership project. NHR stands for Nationales Hochleistungsregion.
It's the German term for national high performance computing basically. And the NHR is an alliance
of computing centers which all have different specialties, but all have a common admission
process and have a so-called harmonized computing environment, which means basically that all the
clusters at the different locations have very similar scheduling system, very similar file system.
So if you manage to get your app work, your program working on one of the HPC systems,
you probably will be able to get it working on the other systems as well. And this little blue
dot is the is Darmstadt, which is where I'm from. And all those systems are procured at some point.
And whenever you do hardware procurement, there are basically is basically the question, well,
what hardware do you get? And the simple answer is you get the best of everything, right? So you get
the most cores, the fastest RAM, the most efficient power delivering units, fastest most storage,
but this is infeasible for all but the largest of HPC computing centers and even those usually
struggle. So usually what you do is you go ahead and say, well, we want to get the best performance
per dollar we spent. And for this, you need to figure out what's the performance is you get for
the dollars. And so you usually use lin pack and stream, which are benchmarking. One is for basically
stressing your floating point unit and turning your rack system into a space heater. And the other
is memory. And then because you do not want to run only synthetic benchmarks, you use some of the
spec HPC suite benchmark and run this and you figure out how much performance do you get basically.
And recently there was this push to get more power efficient when you do HPC community. And so there
the target to hit is performance per watt. And there again, you basically use lin pack. If you
look at the green 500 list of HPC system, they all publish their lin pack score. But this is not
actually representative of what the system will cost during its lifetime. It's usually just what's
your one time investment cost in procuring the actual hardware. And what you really want to figure
out, especially in the case of this distributed national high performance computing association,
is the money we spend actually well invested for the use cases our users have. And this is where
this total cost of ownership project came to be. Because you do not want to score your procurement
only on performance, but you want to have it be a mix of different factors. Of course, you want
to have the initial hard and software investment cost as part of this, but you also want to figure
in cooling costs because this is one of the main cost drivers today. Put your power in and how to
get the heat away that dissipate the heat that you generate. And usually want to have technical
and administrative stuff for your HPC system to actually work properly. And then the last thing
which is power consumption. And it's not power consumption of your idling system because that
is reasonably low, but it's of the job mix you're running. And this job mix is very essential in
this whole thing because the job mix is a very user dependent metric. It is what is the system
actually being used for. So for example, and this is again referring back to these distributed and
slightly specialized computing centers we have in HR. If you do physics simulation, your application
might benefit more from faster CPUs with more core counts than compared to if you're using AI
workloads where you probably can't get enough accelerator cards for your workload. And what
you do is you monitor how your system is used, which is doable. For example, Slurm can do this.
And then you figure out that your users are running lamps and grommets and open form your
typical HPC software. And then you cannot really give that one to the vendors, can you? Because
if you give a big grommet run to your hardware vendor and say, well, you have 48 hours to run
this code through, then your vendor will probably not do it. And in an even more extreme case,
you have this weird institute like the scientific computing institute where I'm from, which runs
one weird A.out executable. They self-compiled with a custom build script. And you cannot give
those to the vendors as well. But the problem is that all those HPC applications are large and
complex and have different coding and software patterns. But they are the most representative
thing you can get about what is actually running on your clusters. And so the idea is if you have some
very big and complex HPC system like the one you see is simulated on the right, which has
some kind of entry point and then does matrix multiplication and conditioning and heavy output
preparation, the thing that actually spends most of the compute cycles is the one in gray.
So the matrix conditioning matrix solving. And if you have a so-called mini-app, which is just
the gray part and not all the other things around, you might be able to shrink this application
significantly. And this mini-app approach was in one actually pioneered by Jan-Patrick Ler,
which was the guy who gave the talk before me. So talk about coincidence. And the basic idea is
you shrink the size of your application, but keep the computational characteristics. So the
computational kernel where actually most of your compute cycles are spent stays the same. And then
you just need to add some wrapper function that sets the kernel up. And then to finish, you just
need to find some way to graceful terminator program because you can then have time measurements,
power measurements on this little part of the actual big program. Great. And so this is why
they needed the compiler guy to do this because they wanted in this total cost of ownership
project the idea was to have a fully automatic extraction pipeline. And the basic idea for
this pipeline was first you analyze the whole program. For this, we used the MetaCG framework.
And those of you who happened to be there last year when I gave a completely different talk,
we were using MetaCG as well. So it's a tool that's used at our institute quite heavily
and allows you to have a representation of how do functions behave according to each other over
the whole program. So you can get a whole program call graph. Once you know how all those functions
relate to another, you can figure out what is the actual kernel. So where are my compute cycles
spent? For this, the intention was to use Pira. The other idea is you just ask a domain expert
what's the slow part of your program and they will probably tell you. So this is much more easy
usually. And then the actual extraction of the kernel. And for this, we developed the Apex tool,
so the app extraction tool. And it's a clang front and based compiler tool that does source code
manipulation. And the basic idea is you query the so-called AST. You do not need to know how
you get the AST, what an AST is. The only thing that you know an AST is a very, very condensed
and information dense form and representation of a single CPP file. So if you have this CPP file
on the right, which only contains the main function, you get the thing on the right. Admittedly,
this is very much shortened. Where you can then find your record declaration, you find your structs,
and you find your assignments, and you find your function calls. So what you then do is you can
query this AST for your information to figure out how these function behave. So if you want to
track the kernel, so you already know which part of the program you want to extract,
you find all the functions that are used for this kernel. So your kernel might call some
subroutines, you want to extract all those subroutines as well. And sadly, the AST is unable to provide us
this information because we only can extract when we have the definition, so the body of the function.
And as we are only limited to one CPP file with our AST information at a time, we only have to
the declaration in this case, which was part of the header. So the print as function in our example
is only declared, it's not defined, we have no body there. So if we have the whole body of a function,
we get it as a whole text block. If we only have a definition, we remember that we need that one,
and we extract it once we actually find the source file that contains the definition for this function.
What we can do is we can find and extract all the used globals, because you usually rely on some
kind of struct definition, you might even be using global state. And this is where the AST has the
information. The whole definition of our struct S was inside the header files, and the header files
are included by the preprocessor, though they are part when the AST is built. Great. So we just
extract those as a text block, and then we need to find all include statements because include is
the last colored thing in our example. And then we run into a little problem. Because remember how
I told you that it's great that the preprocessor put the header files into our source code,
well, include statements are also handled by the preprocessor. So everything that was specified in
this include statement header is put physically in the AST once it's built, and we do not have
the information anymore. So what we need to, and this is also true for defines and if and defs
and pragmas, all those are resolved by the preprocessor, and we do not have any way to really
figure this out once we get to the AST level. So we do not only need to hook into the AST,
but we also need to hook into the preprocessor. Those of you who have actually worked with a
preprocessor might know that the preprocessor is basically doing copy paste. So it's not context
sensitive, it just takes include files, puts it where the include statement was. So we somehow
need to map this context insensitive analysis results we get from our preprocessor hooks
to figure out how do those relate to the context sensitive information that we get from our AST.
And the only thing that those two share in common is source file locations. The preprocessor knows
in which source file line it currently is when it does its copy pasting, and the AST can map back
to the original source file. So what we do is if we now go to a more realistic example, this is an
excerpt from the Lulish code, you do not, it's heavily shortened so there's no way to figure
out what it does exactly, but we are mostly interested in the things colored yellow, which means
it starts with an open MP, if open MP is available statement and then it includes the actual header
and the preprocessor gives us all this information. Whenever it encounters one of these statements,
we get a callback that tells us we found an if open MP, it goes from line one to line three,
and then we found an include statement in line two, and then it's on us to figure out that line one
to three fully encompasses the statement we found in line two, because again we're only doing text
block extraction. So this is the conflicts that happen inside the preprocessor, but if we go on
the preprocessor also tells us that it found an if open MP statement again in line six to 13,
and seven to nine, and 15, but we also have the knowledge that there is a whole function going
from line five to line 20, so we need to marry those two informations together as well, and
this matching process was one of the challenges that we needed to overcome when we did the kernel
extraction. So when we started this whole process, we had a very good idea how to do single translation
unit C code, and we expanded on this to allow for multi-translation unit C code and C with C++
components like new and delete and classes, and we're currently working on getting codes that
makes heavy use of templates working, because the problem once you come into templates is that
if you think back about our analysis step, we only get information about functions and how those
functions relate to another, and templates are in a compiler speaking sense not necessarily
functions, they are descriptions of how functions will be generated at compile time, so our analysis
is currently not offering us the information about the original template, but only about the
instantiated templates as generated by the compiler, so we are currently working on getting
templates to work, and if you think back about the global usage analysis we are doing, if you have
complex class inheritance and polymorphism, we are currently not able to traverse all possible
diamond inheritance hierarchical implementations that are possible in C++, and lastly the idea is to
also allow for automatic check pointing, so the wrapper calls that need to be generated to set
up the environment for the kernel to run, it is theoretically possible to fully automatically
generate those wrapper calls, we just haven't looked into it, and lastly the thing we are
very skeptical if we are ever able to do it is to just mini extract from every C++ code ever written
ever, because there are so many things you can do in C++ that we can try to achieve this, but
I am very skeptical if we ever will be able to do this, but I don't want to leave you on this
kind of depressing note actually, even in a state like this where the tool cannot fully
handle all templates, even in a state where the tool cannot handle most complex inheritance hierarchies,
tool assisted mini abstraction can still be useful, for example if you are willing to
include the templates manually, because your program won't compile with the templates, you can
just copy paste them, then you can get mini abstraction to work right now, and if you are
interested in doing pinpoint optimizations on your source code, you can extract only those small
snippets of code that you actually intend to work on, optimize those, and then do manual
optimization and reintegrate those easily, so there are uses even for a tool that is not able
to handle every C++ code ever written, and if you know of any HPC code that you think has a kernel
that is identifiable and maybe not using the most and the deepest inheritance hierarchies,
let me know, because I am always interested in figuring out how well my tool performs on other
codes, so with this, thank you for your attention, I hope it was kind of interesting and I am open
for questions.
Any questions?
I am the author of a large sparse matrix library, do you have something similar already in your
catalog or collection? So it would be interesting to apply this tool on a library, but usually when
the thing we are doing is we have the whole HPC software and then you call into the library,
so of course your library is probably doing the heavy lifting and therefore probably doing the
kernel part of the program, but extracting this one, we can look into that, but extracting a call
to a library is relatively speaking very easy, so programs whose basic structure is do some setup,
call an HPC library, get input back, those are basically mini-apps in the sense that we are
talking about because they are not doing most of the heavy computing themselves, but if your library
has internal conditioning or matrix solving capabilities that you know of that your library
struggles to do, then we are talking again, so just let me know the name of your library and I will
try to look into it.
Four questions?
Alex was first.
Hi. I was wondering your mini-apps seem to be focused on compute-intensive parts of the code,
do you also construct mini-apps for storage-intensive applications or something else?
So the automatic identification wire, the PIRRA tool, tries to figure out what is the compute-
intensive part of the program, and this is the only kernel, so to speak, that we are able to
automatically identify, but if you as a domain expert know that this is the part of our program
where we are IO limited, then this is nothing PIRRA can identify, but if you say I want to extract
starting from this function, our tool should in theory be able to extract the IO limited part of
your code. So this is the point where you as a domain expert need to specify this is the part I
want to extract, because the only thing that is very prominently identifiable is compute-limited
parts of the program, but yeah, in theory it should work.
And second question, I might allow, and do you have a library of mini-apps that are ready to use for
others by third parties? So currently we're doing our benchmark, our benchmarks on already
existing mini-apps, so we're doing mini-app extraction from mini-apps because I am profiting from
the small size of those mini-apps to validate that my program actually runs, for example the
Lulash example I showed is a mini-app in itself. If I remember correctly, it's a shock simulation
in fluids, please don't quote me on that, but it's a great code and it's very easy to work with,
so I'm using that for my evaluation, but the idea is to get it to work on larger codes, for example
we're currently looking to the ISSM ice sheet and system model, which is a well ice melting
simulation for large ice sheets, but yeah we're always looking for other codes, and if you have
something that is IO bound then of course tell me. Thank you. Plenty of time for more questions, Chris.
Sorry to be that guy, but how hard would it be to adapt this approach to Fortran?
Fortran tooling in general is something that has been of interest at our institute for a long time,
but the problem with Fortran tooling is that most of our knowledge is coming from the Klang front
and so the C language front end, and I am not sure if Fortran, the current new Fortran LLVM front end,
offers the same analysis and query capabilities as Klang, and the idea to move lower in the hierarchy
towards the LLVM IR, which is more target agnostic, or language agnostic more really,
is that as soon as you go down to IR it's very hard to go back to figure out what was the source
code files that actually made up this IR. So yes we already have in the back of our mind that there
are other languages that are used in HPC systems, and usually if I present this approach I'm getting
asked, well we have some Fortran codes, we have some Python codes, how does your approach work,
but we are sadly limited by the Klang front end's capabilities, so C++, Objective C,
which we have never tried, I'm not saying that we are able to do Objective C, but C and C++ we
have tried and are currently limited to because of our design choice. There is a Fortran front end
for Klang. Yes, but this is a Fortran front end for the LLVM infrastructure.
The C-Lang front end is the part of the LLVM project which takes C code and translates to the
LLVM IR, so I'm not sure if we're talking about the same thing. There is a Fortran front end
for LLVM, yes, but I would be very surprised if there is a way to translate Fortran code into
something that Klang can understand, but I might be wrong, there is a myriad of interesting software
repositories, but currently consider Miscoptic. Then you can use this small part of your code
to stress the floating point unit only of linear hardware. Then this other part is very well
vectorized, so we extract this one and suddenly you are able to use AVX 512 instructions.
The idea is to extract every code intensive part into its own little package and then benchmark
with those separate packages, so at the end you can get an idea about the whole performance of your
as your individual kernel regions. This would be the approach I take.
Maybe one more? In the back, okay.
Great talk, thank you so much. A couple of questions, Chris. I believe there was a paper
about a Fortran mini app extractor some time ago. I can dig that up and send it to you.
If I remember that, if not, shoot me a message. Then the flank front end in the LLVM project
currently is not compatible with Klang, so we do get different ASTs. This approach is actually
working at the AST level, so if it were using the LLVM IR level and then somehow like magically
map back dwarf, then it would work, but there was a different project that did that, it worked kind
of okay. I would have a question about you for the complex inheritance hierarchies. Do you have any
idea on how you could tackle that, approach that, represent this thing across the whole program or
things so, I mean, did you have spent any time so far looking at that or did you say like okay,
that's future me or future someone going to do that?
So thank you for the question. So it's a mix of both. I spent some time thinking about it and decided
it was for future me because I didn't assume it to be very easy, but you already mentioned the
general idea that as soon as you go into more complex inheritance chains, you aren't able to
extract everything from one header file per se, so you need to do the same opportunistic extraction
idea that we do for functions, but now for classes, structs and all their possible inheritance parents.
So this is something that we need to analyze on the whole program scale, so this is something that
in the not near future, but in the foreseeable future, I intend to put as an analysis pass into
the CG collector, which you probably are familiar with. So the idea is that this tool is then able
to annotate all this information as metadata and then once we merge it, we get a very good,
hopefully, impression of how those inheritance chains flow through the whole program. So the idea.
Thanks. Okay, that's all we have time for. Thanks a lot, Tim.