[00:00.000 --> 00:07.000]  Okay, so we're switching topics a bit away from MPI to something at least a little bit
[00:12.260 --> 00:15.460]  different, link time call graph analysis.
[00:15.460 --> 00:22.460]  All right, thank you. So, yeah, we're going to talk about user-guided program instrumentation
[00:23.380 --> 00:29.880]  approach and especially a link time call graph analysis extension to that project, which
[00:29.880 --> 00:36.880]  kind of makes it more usable. So, to give some background on this work, I'm involved in
[00:36.960 --> 00:42.920]  a project called XRFoam that deals with the Open Foam Computational Fluid Dynamics Toolbox.
[00:42.920 --> 00:49.000]  And that's a very complex code, quite large. And the goal here is to improve the performance
[00:49.000 --> 00:55.680]  of Open Foam for HPC systems, especially for the exascale error. And one of the things
[00:55.680 --> 01:02.600]  we do is, yeah, develop empirical performance models. And for that, we have sort of developed
[01:02.600 --> 01:08.240]  like workflow, how we do that. We start with several measurements where we get an initial
[01:08.240 --> 01:14.760]  overview and identify hotspots. Then, based on these hotspots, do an analysis of the critical
[01:14.760 --> 01:21.760]  kernels. And finally, we can do the empirical modeling in order to find scalability bugs
[01:21.760 --> 01:28.760]  and predict performance when scaling out. And so, especially for the focus measurements
[01:30.240 --> 01:35.280]  and the modeling, you need quite accurate and reliable methods to be sure that the data
[01:35.280 --> 01:40.880]  you collect is right and you have the right level of detail. And what we use for that
[01:40.880 --> 01:47.880]  is code instrumentation. And just to give the background, I want to give an example.
[01:47.880 --> 01:54.120]  For example, in GCC and Clang, you have the F instrument functions flag, which does a
[01:54.120 --> 02:01.120]  very basic instrumentation. But if you activate that flag, you insert these, or the compiler
[02:03.000 --> 02:10.000]  inserts these function enter and exit probes, which are then a runtime interface with a
[02:10.160 --> 02:16.880]  profiling tool which records runtime. Or, yeah, more involved metrics, maybe performance
[02:16.880 --> 02:23.120]  counters, stuff like that. And the big problem with this instrumentation approach, especially
[02:23.120 --> 02:30.120]  compared to other mechanisms like sampling, is that it can increase the run times by orders
[02:30.120 --> 02:35.280]  of magnitude if you're not careful, if you just instrument every function. So, you have
[02:35.280 --> 02:42.280]  to have some kind of selection mechanism in place to prevent that. And sort of the method
[02:42.280 --> 02:49.280]  that's most commonly used is to use profile-based filtering, either manual, where you look at
[02:50.000 --> 02:56.200]  the profile that is generated by the instrumented measurement, and then determine, like, which
[02:56.200 --> 03:02.280]  functions, maybe, for example, very small or called very often and don't do much work,
[03:02.280 --> 03:08.960]  so that can be filtered out. And there are also tools like Scope, which can help in generating
[03:08.960 --> 03:15.960]  these filters. However, this is only, like, on a purely profile base and there's no, yeah,
[03:23.240 --> 03:28.440]  no calling contacts involved in the decision. So, there are some other call graph-based
[03:28.440 --> 03:34.440]  approaches where you have a static call graph that is then used to make the selection.
[03:34.440 --> 03:41.440]  I want to highlight, too, there's Pira, which does automatic iterative refinement. So, we
[03:42.640 --> 03:49.640]  start with a static selection based on the statically collected call graph, and then
[03:50.040 --> 03:56.440]  iteratively compile a program with instrumentation, execute it, measure it, and then look at
[03:56.440 --> 04:03.200]  the profile in order to determine what to filter out. And then the tool we're working
[04:03.200 --> 04:09.560]  on is the copy tool, which is short for compile-assisted performance instrumentation, and that one
[04:09.560 --> 04:16.560]  is more focused on allowing the user to specify what he wants to measure, so he can say, okay,
[04:16.960 --> 04:22.080]  I'm interested in MPI communications, so that is what I want to measure.
[04:22.080 --> 04:29.080]  So, here's a very high-level overview of how copy works. So, you have your source code,
[04:29.080 --> 04:36.080]  and that is put into the static analysis in order to generate the call graph. And then
[04:36.840 --> 04:43.340]  at the same time, the user specifies the measurement objective in front of a selection spec. We
[04:43.340 --> 04:50.340]  have a simple DSL for that. And this together is then fed into the copy tool in order to
[04:50.560 --> 04:57.560]  generate the instrumentation configuration, which is hopefully low overhead.
[04:57.560 --> 05:04.560]  And, yeah, so let's consider an example for that. So, we might have the following scenario.
[05:04.600 --> 05:10.560]  So, I want to record all call paths that contain MPI communication. Additionally, I want to
[05:10.560 --> 05:15.400]  measure functions that contain loops with at least 10 floating point operations, and
[05:15.400 --> 05:21.920]  I don't care about system headers or inline functions. And this is how this looks as a
[05:21.920 --> 05:28.920]  copy spec. I won't get into the details of the syntax, but you can combine different
[05:31.040 --> 05:38.040]  selection modules where you start with the set of all functions and then sort of here
[05:38.800 --> 05:45.000]  select inline functions or system header functions. And then you can combine them to, in the
[05:45.000 --> 05:52.000]  end, produce one set that you want to instrument. And for this particular example, this reduces
[05:53.880 --> 06:00.880]  the number of instrument functions by 74 percent for our form test case.
[06:01.720 --> 06:07.280]  So, the call graph is sort of the base data structure that we use for all the analysis
[06:07.280 --> 06:14.200]  and for the selection. And this is currently generated based on the source level by a tool
[06:14.200 --> 06:20.440]  called MetaCG. And this can be a bit cumbersome for very complex applications, such as Open
[06:20.440 --> 06:26.920]  Foom, because you, yeah, you need sort of a separate source code analysis step, and
[06:26.920 --> 06:31.400]  that can be difficult to set up, especially if there's like shared library dependencies
[06:31.400 --> 06:37.760]  stuff like that. So, it can be tricky. And so, we looked into how we can maybe generate
[06:37.760 --> 06:44.400]  the call graph at different stages. And so, this is what Tim is going to talk about, how
[06:44.400 --> 06:50.560]  we can generate the call graph at different program stages. And then we introduced the
[06:50.560 --> 06:57.560]  caged compiler plugin, where we, yeah, evaluate how this can be done at link time.
[06:57.560 --> 07:04.560]  So, well, thank you. So, we were interested in the whole program call graphs. This is
[07:10.680 --> 07:15.720]  because CAPI, the tool that does our instrumentation, needs to be aware of every function inside
[07:15.720 --> 07:21.080]  our call graph, which means that we are, that it's necessary to have a whole program view.
[07:21.080 --> 07:27.480]  And the tools that were used like MetaCG were able to create a call graph, and they
[07:27.480 --> 07:31.840]  have the distinct advantage of being able to annotate metadata to this call graph. This
[07:31.840 --> 07:38.000]  is where the name MetaCG came from. And this means that we not only can answer structural
[07:38.000 --> 07:43.120]  questions like we want to instrument every function that eventually call paths down
[07:43.120 --> 07:49.960]  to some MPI call, but we can also answer instrument based on metadata. For example, the loop depth
[07:49.960 --> 07:56.600]  specification or floating point operations. And yes, there were multiple possible ways
[07:56.600 --> 08:01.320]  to generate a call graph. One of them is source code. One of them is the intermediate representation
[08:01.320 --> 08:08.040]  that is basically part of every compiler, especially the LLVM compile pipeline, and
[08:08.040 --> 08:14.760]  machine code at the very last. So, the basic idea is, well, we have the source code. Let's
[08:14.760 --> 08:20.200]  do call graph generation on the source code, which is relatively easy. MetaCG is doing
[08:20.200 --> 08:25.160]  it on a source code basis. But this means as you feed every single source code file
[08:25.160 --> 08:30.000]  into MetaCG, this means that your view of the program is limited to one translation unit
[08:30.000 --> 08:36.880]  at a time. So, you then need to merge those part call graphs of every translation unit
[08:36.880 --> 08:42.960]  together to generate this overview of the whole program that you need. The information
[08:42.960 --> 08:46.680]  you then gather from your profiling maps very cleanly back to your source code, because
[08:46.680 --> 08:52.480]  once you find a function, well, it's named the same way in your source code. But on the
[08:52.480 --> 08:57.160]  other hand, what you write in your source code is not necessarily what's actually executed
[08:57.160 --> 09:01.360]  on the machine, right? Because there might be code transformation, constant propagation,
[09:01.360 --> 09:05.800]  dead code elimination, inlining. And this is actually something we want to be aware
[09:05.800 --> 09:09.400]  of if we are doing instrumentation, not that we want to instrument a function that doesn't
[09:09.400 --> 09:16.560]  exist anymore. Also, this merging of the translation units means that the user is involved. The
[09:16.560 --> 09:22.400]  user currently has to specify when he uses MetaCG which source code translation units
[09:22.400 --> 09:28.520]  belong to one target. And then manually has to tell MetaCG these functions are all to
[09:28.520 --> 09:34.040]  be merged. Please generate the call graph for that. And the user might not perfectly emulate
[09:34.040 --> 09:40.240]  the linker behavior. So, there are different resolution types that a linker might choose
[09:40.240 --> 09:47.080]  if there are samely named structs or classes. And, depending on how you are implementing
[09:47.080 --> 09:51.680]  your merging process, you might have slight differences between your call graph that you
[09:51.680 --> 09:57.200]  generated and that's what the linker will eventually do. So, the other extreme would
[09:57.200 --> 10:01.320]  be, well, let's do it on the compiled machine code then. Reverse engineering tools like
[10:01.320 --> 10:06.360]  radar or Ghidra are able to generate call graphs and binary data just fine. And those
[10:06.360 --> 10:10.240]  have the very distinct advantages that this is actually what is run on the machine. There
[10:10.240 --> 10:15.200]  are no code transformation left. You have the advantage of being able to see machine
[10:15.200 --> 10:23.000]  code optimization passes if they are influencing the generated call graph. But, on the other
[10:23.000 --> 10:27.720]  hand, a lot of information, the metadata that we also would like to be able to instrument
[10:27.720 --> 10:32.480]  based upon are lost as soon as we go down to machine code. Inlining already happened,
[10:32.480 --> 10:38.360]  so there is no function annotated with please inline anymore. Also, pointer type information,
[10:38.360 --> 10:42.240]  as we heard in the talk earlier, gets lost as soon as we go down to machine type. And
[10:42.240 --> 10:47.360]  constness is also something that is more to be inferred than actually stated once we
[10:47.360 --> 10:54.000]  go down to machine code. And so, we decided the best of both worlds is probably the LLVMIR
[10:54.000 --> 11:00.240]  because it's a heavily annotated representation. It is close enough to what will run on the
[11:00.240 --> 11:07.520]  machine that we have the ability to observe the code transformation. We are able to give
[11:07.520 --> 11:12.720]  more specific estimates on what the actual cost of a function might be because we have
[11:12.720 --> 11:19.240]  more clear way of tracking, for example, instruction counts, floating ops, and integer ops. On
[11:19.240 --> 11:23.800]  the other hand, it's also close enough to what the user actually wrote because we're
[11:23.800 --> 11:28.360]  not down on the machine code yet. And we can figure out the inlining stuff, the constness,
[11:28.360 --> 11:35.240]  the virtual functions. We can get type information in the IR. And if we do it at link time, we
[11:35.240 --> 11:40.720]  are not even limited to the translation unit by translation unit scope that source code
[11:40.720 --> 11:45.560]  based approaches are. So, if you have your pretty default compile pipeline, you have
[11:45.560 --> 11:49.400]  your source code, which builds a translation unit, gets fed into the compiler, which outputs
[11:49.400 --> 11:54.640]  intermediate representation, and then the optimizer runs there and multiple source code
[11:54.640 --> 12:00.720]  translation unit optimized IR modules are fed into the linker. And we can do our call
[12:00.720 --> 12:05.960]  graph analysis inside the linker, solve our translation unit problem, and are able to
[12:05.960 --> 12:12.160]  have all our information ready. So, to do this, we developed the cage plugin. Cage stands
[12:12.160 --> 12:19.720]  for call graph embedding LLVM plugin. And it basically generates a call graph using
[12:19.720 --> 12:26.120]  some of LLVM tools, does some annotation, virtual function call analysis, and this can
[12:26.120 --> 12:32.240]  run as part of the optimizer in the pipeline, but it can also run as part of the LLVM linker.
[12:32.240 --> 12:38.840]  Also as a plugin, for which we use a slightly modified version of the LLVM linker, but
[12:38.840 --> 12:45.400]  the basic logic of running plugins in the LLVM linker was there. So, then we do a V
[12:45.400 --> 12:49.960]  table analysis, our metadata annotation, because it's all available to us. And then
[12:49.960 --> 12:55.200]  we embed the result into the binary, which enables dynamic augmentation. And I will come
[12:55.200 --> 13:00.960]  to that one later. So, at link time, we're doing static analysis basically. And as I
[13:00.960 --> 13:05.960]  already mentioned, we split our information in basically two types. One of them is structural
[13:05.960 --> 13:11.800]  information like call hierarchies, call paths, call depth, the number of children, how deep
[13:11.800 --> 13:18.480]  we are in our path. And you also have virtual function calls, which are mostly structural,
[13:18.480 --> 13:22.680]  because once you have virtual and polymorphic calls, you have like a set of functions that
[13:22.680 --> 13:28.520]  are probably being called by that pointer. And so, we can narrow down the possibilities
[13:28.520 --> 13:33.240]  of which functions are called, but we cannot actually statically figure out this function
[13:33.240 --> 13:38.600]  is getting called. So, it's slightly metadata based, but it's also mostly structural. On
[13:38.600 --> 13:42.960]  the metadata information side, we have instruction composition, so we can determine what is the
[13:42.960 --> 13:47.720]  relation between arithmetic operations and memory operations, for example, or we can
[13:47.720 --> 13:54.880]  generate local and global loop depth estimates. And then we have inlining information, which
[13:54.880 --> 14:00.960]  is metadata, because inlining is not like a must do for a compiler, just because you
[14:00.960 --> 14:05.200]  have specified inlining for a function doesn't mean that the compiler will actually inline
[14:05.200 --> 14:11.160]  the function. So, it's partly structural information and partly metadata, so you see
[14:11.160 --> 14:16.960]  there's no clear like line between those, they blur at some points, but we can represent
[14:16.960 --> 14:23.520]  all those in the metadata annotated call graph. And if you remember, we were able to do dynamic
[14:23.520 --> 14:29.000]  augmentation. Well, what is dynamic augmentation? If you remember, each object contains the
[14:29.000 --> 14:35.720]  call graph that we generated, which means that the call graphs can be aggregated at runtime
[14:35.720 --> 14:40.360]  if a shared library is loaded, because even if you are at link time, even if you can see
[14:40.360 --> 14:45.160]  all the statically linked objects, all the translation units that belong to your target,
[14:45.160 --> 14:51.800]  your binary might load a shared library, which then, well, you're unaware of. So, the idea
[14:51.800 --> 14:57.840]  is, as soon as the main executable is loaded, it passes its embedded call graph on startup
[14:57.840 --> 15:03.120]  to a runtime collecting library. And then, the main executable can load whatever shared
[15:03.120 --> 15:10.320]  object it wants. And if this shared object also contains a call graph, then this runtime
[15:10.320 --> 15:15.480]  collector gets passed this call graph on the first load of the shared object and can aggregate
[15:15.480 --> 15:20.200]  it like merging. So, we're basically back to the translation unit by translation unit
[15:20.200 --> 15:27.080]  based approach, but now we're doing shared library on binary and executable based merging.
[15:27.080 --> 15:31.880]  And then, we attach all this data together to one really big whole program call graph,
[15:31.880 --> 15:36.440]  now including shared objects. And then, we can export this, for example, and pass it
[15:36.440 --> 15:39.960]  back to Karpi for some further refinement of the instrumentation.
[15:39.960 --> 15:54.880]  Go ahead. All right. Thanks. So, to put it all together, for Karpi, we have the call
[15:54.880 --> 16:00.040]  graph analysis approach, Tim just explained. So, for each object file, we have the call
[16:00.040 --> 16:06.800]  graph analysis and then the embedding. And then, on the runtime side, we can sort of
[16:06.800 --> 16:13.040]  merge the main executable call graph with the shared libraries as they are loaded. And
[16:13.040 --> 16:19.320]  we defer the instrumentation using a dynamic instrumentation approach in order to sort
[16:19.320 --> 16:30.120]  of apply that selection dynamically. And, yeah, that's how it works. So, to summarize,
[16:30.120 --> 16:37.520]  we are developing the Karpi tool for instrumentation selection based on call graph analysis. And
[16:37.520 --> 16:44.560]  we have explored this cage plugin for generating this call graph information at link time,
[16:44.560 --> 16:50.440]  which allows whole program visibility and this dynamic documentation. And together
[16:50.440 --> 16:57.120]  with Karpi, we can, yeah, use the embedded call graph to run the selection at runtime
[16:57.120 --> 17:07.760]  and thereby improving Karpi to make the compilation process and the analysis process more streamlined.
[17:07.760 --> 17:13.640]  And this is sort of an active development. So, at this point, we don't have a very detailed
[17:13.640 --> 17:20.000]  evaluation about performance and stuff like that. So, there's some concerns, for example,
[17:20.000 --> 17:27.240]  when you go to very big programs and do LTO, there might be performance problems. So, there
[17:27.240 --> 17:37.720]  might be more work to make it viable in that regard. But, yeah, it works well in a prototype
[17:37.720 --> 17:44.720]  fashion. Yeah. Thank you.
[17:44.720 --> 17:53.240]  Any questions?
[17:53.240 --> 18:07.320]  Perhaps I was a bit distracted. So, I have two questions. If you can comment on Lambda
[18:07.320 --> 18:19.160]  functions, OMP sections, or OMP sections of the code, not specific constructs, and instruction
[18:19.160 --> 18:27.360]  cache. If you can comment how those are handled, those three aspects, let's say.
[18:27.360 --> 18:34.280]  So when it comes to OpenMP, we don't at the moment have any specific OpenMP profiling
[18:34.280 --> 18:41.440]  interface that we target. So, this might be something that is probably useful in the future.
[18:41.440 --> 18:49.360]  But for now, we just do the basic functions mutation and select on that. Yeah. Then the
[18:49.360 --> 18:55.720]  other point was, sorry, could you repeat?
[18:55.720 --> 18:59.480]  So, do you want to comment on that?
[18:59.480 --> 19:05.840]  So, lambdas and caches, so caching, no, there is no logic to handle caching in any way.
[19:05.840 --> 19:11.200]  And regarding lambdas and OpenMP, it's, if you're talking about profiling, you've got
[19:11.200 --> 19:16.120]  your answer right there. But if you're talking about generating call graphs in which lambdas
[19:16.120 --> 19:23.680]  and OpenMP runtime calls are available, then the call graph will actually figure out that
[19:23.680 --> 19:32.800]  there are OpenMP runtime calls, and will correctly, if I remember correctly, will figure out that
[19:32.800 --> 19:38.400]  this function calls back to the OpenMP runtime, because once you are in IR, the runtimes actually
[19:38.400 --> 19:43.800]  carved out, all the pragmas were removed from the actual source code. So, we are aware
[19:43.800 --> 19:48.680]  of OpenMP, but we are not using that information currently for any profiling. But you could
[19:48.680 --> 19:55.320]  do metadata-based copy selection with it. So, every call path that eventually leads
[19:55.320 --> 20:03.080]  to an OpenMP runtime call would be a valid instrumentation using copy.
[20:03.080 --> 20:08.360]  Just that the question of the instruction cache is whether, this is my ignorance, but
[20:08.360 --> 20:14.640]  if you are reintroducing a lot of new instructions here in the code, or in the code that is being
[20:14.640 --> 20:22.280]  read, I was thinking whether too much data ends up, maybe I didn't understand well.
[20:22.280 --> 20:27.520]  So, this is more of a performance-related question, right? Okay, so, yes, of course,
[20:27.520 --> 20:31.840]  you introduce a new instruction whenever a shared library is loaded, because we then
[20:31.840 --> 20:37.840]  add instructions that pass the call graph back to our runtime collecting facility, and
[20:37.840 --> 20:42.920]  we also introduce instructions because we are using a profiling approach, which are function
[20:42.920 --> 20:48.200]  calls. So, yes, we are impeding the instruction fetching, instruction caching flow, which
[20:48.200 --> 20:54.000]  is also why profiling has rather high overhead compared to sampling approaches, for example.
[20:54.000 --> 21:00.840]  But as Sebastian told, we have not really extensively profiled our application, quite
[21:00.840 --> 21:13.960]  ironic, so we are not aware how much the benefit or the impact actually would be.
[21:13.960 --> 21:20.280]  So in your slide, you say you have a fork of LLD. The obvious question is, what's stopping
[21:20.280 --> 21:26.280]  you from upstreaming this? So, yes, we have a fork of LLD that just basically
[21:26.280 --> 21:34.080]  exposes the flag load new pass manager plugin, and you pass the plugin, and it does the rest.
[21:34.080 --> 21:39.280]  What's currently holding us back from upstreaming is it's not very well developed, it was coupled
[21:39.280 --> 21:44.720]  together in half a week or something, and there already is an open merge request on
[21:44.720 --> 21:52.320]  fabricator that implements this exact functionality for, if I remember correctly, LLDM9, which
[21:52.320 --> 22:03.000]  was abandoned for a year until it was totally abandoned and closed, and so we didn't actually
[22:03.000 --> 22:09.280]  figure out how to make this more interesting. Don't take that as a signal. People just move
[22:09.280 --> 22:14.440]  jobs or whatever, so try it again and bash people, and I can help you with that as well,
[22:14.440 --> 22:18.280]  find the right people to get this, because it seems like a simple and obvious thing to
[22:18.280 --> 22:24.640]  have. It isn't actually that hard. Apparently, there wasn't much interest in the community
[22:24.640 --> 22:37.000]  back in 2020? Yes, that's too long ago. We will polish it a little much, and then hopefully
[22:37.000 --> 22:48.760]  get this one upstream. Thanks. Any other questions? No. Okay. Thank you very much.
[22:48.760 --> 23:07.720]  Thank you very much, Sebastian.