[00:00.000 --> 00:07.000] Okay, so we're switching topics a bit away from MPI to something at least a little bit [00:12.260 --> 00:15.460] different, link time call graph analysis. [00:15.460 --> 00:22.460] All right, thank you. So, yeah, we're going to talk about user-guided program instrumentation [00:23.380 --> 00:29.880] approach and especially a link time call graph analysis extension to that project, which [00:29.880 --> 00:36.880] kind of makes it more usable. So, to give some background on this work, I'm involved in [00:36.960 --> 00:42.920] a project called XRFoam that deals with the Open Foam Computational Fluid Dynamics Toolbox. [00:42.920 --> 00:49.000] And that's a very complex code, quite large. And the goal here is to improve the performance [00:49.000 --> 00:55.680] of Open Foam for HPC systems, especially for the exascale error. And one of the things [00:55.680 --> 01:02.600] we do is, yeah, develop empirical performance models. And for that, we have sort of developed [01:02.600 --> 01:08.240] like workflow, how we do that. We start with several measurements where we get an initial [01:08.240 --> 01:14.760] overview and identify hotspots. Then, based on these hotspots, do an analysis of the critical [01:14.760 --> 01:21.760] kernels. And finally, we can do the empirical modeling in order to find scalability bugs [01:21.760 --> 01:28.760] and predict performance when scaling out. And so, especially for the focus measurements [01:30.240 --> 01:35.280] and the modeling, you need quite accurate and reliable methods to be sure that the data [01:35.280 --> 01:40.880] you collect is right and you have the right level of detail. And what we use for that [01:40.880 --> 01:47.880] is code instrumentation. And just to give the background, I want to give an example. [01:47.880 --> 01:54.120] For example, in GCC and Clang, you have the F instrument functions flag, which does a [01:54.120 --> 02:01.120] very basic instrumentation. But if you activate that flag, you insert these, or the compiler [02:03.000 --> 02:10.000] inserts these function enter and exit probes, which are then a runtime interface with a [02:10.160 --> 02:16.880] profiling tool which records runtime. Or, yeah, more involved metrics, maybe performance [02:16.880 --> 02:23.120] counters, stuff like that. And the big problem with this instrumentation approach, especially [02:23.120 --> 02:30.120] compared to other mechanisms like sampling, is that it can increase the run times by orders [02:30.120 --> 02:35.280] of magnitude if you're not careful, if you just instrument every function. So, you have [02:35.280 --> 02:42.280] to have some kind of selection mechanism in place to prevent that. And sort of the method [02:42.280 --> 02:49.280] that's most commonly used is to use profile-based filtering, either manual, where you look at [02:50.000 --> 02:56.200] the profile that is generated by the instrumented measurement, and then determine, like, which [02:56.200 --> 03:02.280] functions, maybe, for example, very small or called very often and don't do much work, [03:02.280 --> 03:08.960] so that can be filtered out. And there are also tools like Scope, which can help in generating [03:08.960 --> 03:15.960] these filters. However, this is only, like, on a purely profile base and there's no, yeah, [03:23.240 --> 03:28.440] no calling contacts involved in the decision. So, there are some other call graph-based [03:28.440 --> 03:34.440] approaches where you have a static call graph that is then used to make the selection. [03:34.440 --> 03:41.440] I want to highlight, too, there's Pira, which does automatic iterative refinement. So, we [03:42.640 --> 03:49.640] start with a static selection based on the statically collected call graph, and then [03:50.040 --> 03:56.440] iteratively compile a program with instrumentation, execute it, measure it, and then look at [03:56.440 --> 04:03.200] the profile in order to determine what to filter out. And then the tool we're working [04:03.200 --> 04:09.560] on is the copy tool, which is short for compile-assisted performance instrumentation, and that one [04:09.560 --> 04:16.560] is more focused on allowing the user to specify what he wants to measure, so he can say, okay, [04:16.960 --> 04:22.080] I'm interested in MPI communications, so that is what I want to measure. [04:22.080 --> 04:29.080] So, here's a very high-level overview of how copy works. So, you have your source code, [04:29.080 --> 04:36.080] and that is put into the static analysis in order to generate the call graph. And then [04:36.840 --> 04:43.340] at the same time, the user specifies the measurement objective in front of a selection spec. We [04:43.340 --> 04:50.340] have a simple DSL for that. And this together is then fed into the copy tool in order to [04:50.560 --> 04:57.560] generate the instrumentation configuration, which is hopefully low overhead. [04:57.560 --> 05:04.560] And, yeah, so let's consider an example for that. So, we might have the following scenario. [05:04.600 --> 05:10.560] So, I want to record all call paths that contain MPI communication. Additionally, I want to [05:10.560 --> 05:15.400] measure functions that contain loops with at least 10 floating point operations, and [05:15.400 --> 05:21.920] I don't care about system headers or inline functions. And this is how this looks as a [05:21.920 --> 05:28.920] copy spec. I won't get into the details of the syntax, but you can combine different [05:31.040 --> 05:38.040] selection modules where you start with the set of all functions and then sort of here [05:38.800 --> 05:45.000] select inline functions or system header functions. And then you can combine them to, in the [05:45.000 --> 05:52.000] end, produce one set that you want to instrument. And for this particular example, this reduces [05:53.880 --> 06:00.880] the number of instrument functions by 74 percent for our form test case. [06:01.720 --> 06:07.280] So, the call graph is sort of the base data structure that we use for all the analysis [06:07.280 --> 06:14.200] and for the selection. And this is currently generated based on the source level by a tool [06:14.200 --> 06:20.440] called MetaCG. And this can be a bit cumbersome for very complex applications, such as Open [06:20.440 --> 06:26.920] Foom, because you, yeah, you need sort of a separate source code analysis step, and [06:26.920 --> 06:31.400] that can be difficult to set up, especially if there's like shared library dependencies [06:31.400 --> 06:37.760] stuff like that. So, it can be tricky. And so, we looked into how we can maybe generate [06:37.760 --> 06:44.400] the call graph at different stages. And so, this is what Tim is going to talk about, how [06:44.400 --> 06:50.560] we can generate the call graph at different program stages. And then we introduced the [06:50.560 --> 06:57.560] caged compiler plugin, where we, yeah, evaluate how this can be done at link time. [06:57.560 --> 07:04.560] So, well, thank you. So, we were interested in the whole program call graphs. This is [07:10.680 --> 07:15.720] because CAPI, the tool that does our instrumentation, needs to be aware of every function inside [07:15.720 --> 07:21.080] our call graph, which means that we are, that it's necessary to have a whole program view. [07:21.080 --> 07:27.480] And the tools that were used like MetaCG were able to create a call graph, and they [07:27.480 --> 07:31.840] have the distinct advantage of being able to annotate metadata to this call graph. This [07:31.840 --> 07:38.000] is where the name MetaCG came from. And this means that we not only can answer structural [07:38.000 --> 07:43.120] questions like we want to instrument every function that eventually call paths down [07:43.120 --> 07:49.960] to some MPI call, but we can also answer instrument based on metadata. For example, the loop depth [07:49.960 --> 07:56.600] specification or floating point operations. And yes, there were multiple possible ways [07:56.600 --> 08:01.320] to generate a call graph. One of them is source code. One of them is the intermediate representation [08:01.320 --> 08:08.040] that is basically part of every compiler, especially the LLVM compile pipeline, and [08:08.040 --> 08:14.760] machine code at the very last. So, the basic idea is, well, we have the source code. Let's [08:14.760 --> 08:20.200] do call graph generation on the source code, which is relatively easy. MetaCG is doing [08:20.200 --> 08:25.160] it on a source code basis. But this means as you feed every single source code file [08:25.160 --> 08:30.000] into MetaCG, this means that your view of the program is limited to one translation unit [08:30.000 --> 08:36.880] at a time. So, you then need to merge those part call graphs of every translation unit [08:36.880 --> 08:42.960] together to generate this overview of the whole program that you need. The information [08:42.960 --> 08:46.680] you then gather from your profiling maps very cleanly back to your source code, because [08:46.680 --> 08:52.480] once you find a function, well, it's named the same way in your source code. But on the [08:52.480 --> 08:57.160] other hand, what you write in your source code is not necessarily what's actually executed [08:57.160 --> 09:01.360] on the machine, right? Because there might be code transformation, constant propagation, [09:01.360 --> 09:05.800] dead code elimination, inlining. And this is actually something we want to be aware [09:05.800 --> 09:09.400] of if we are doing instrumentation, not that we want to instrument a function that doesn't [09:09.400 --> 09:16.560] exist anymore. Also, this merging of the translation units means that the user is involved. The [09:16.560 --> 09:22.400] user currently has to specify when he uses MetaCG which source code translation units [09:22.400 --> 09:28.520] belong to one target. And then manually has to tell MetaCG these functions are all to [09:28.520 --> 09:34.040] be merged. Please generate the call graph for that. And the user might not perfectly emulate [09:34.040 --> 09:40.240] the linker behavior. So, there are different resolution types that a linker might choose [09:40.240 --> 09:47.080] if there are samely named structs or classes. And, depending on how you are implementing [09:47.080 --> 09:51.680] your merging process, you might have slight differences between your call graph that you [09:51.680 --> 09:57.200] generated and that's what the linker will eventually do. So, the other extreme would [09:57.200 --> 10:01.320] be, well, let's do it on the compiled machine code then. Reverse engineering tools like [10:01.320 --> 10:06.360] radar or Ghidra are able to generate call graphs and binary data just fine. And those [10:06.360 --> 10:10.240] have the very distinct advantages that this is actually what is run on the machine. There [10:10.240 --> 10:15.200] are no code transformation left. You have the advantage of being able to see machine [10:15.200 --> 10:23.000] code optimization passes if they are influencing the generated call graph. But, on the other [10:23.000 --> 10:27.720] hand, a lot of information, the metadata that we also would like to be able to instrument [10:27.720 --> 10:32.480] based upon are lost as soon as we go down to machine code. Inlining already happened, [10:32.480 --> 10:38.360] so there is no function annotated with please inline anymore. Also, pointer type information, [10:38.360 --> 10:42.240] as we heard in the talk earlier, gets lost as soon as we go down to machine type. And [10:42.240 --> 10:47.360] constness is also something that is more to be inferred than actually stated once we [10:47.360 --> 10:54.000] go down to machine code. And so, we decided the best of both worlds is probably the LLVMIR [10:54.000 --> 11:00.240] because it's a heavily annotated representation. It is close enough to what will run on the [11:00.240 --> 11:07.520] machine that we have the ability to observe the code transformation. We are able to give [11:07.520 --> 11:12.720] more specific estimates on what the actual cost of a function might be because we have [11:12.720 --> 11:19.240] more clear way of tracking, for example, instruction counts, floating ops, and integer ops. On [11:19.240 --> 11:23.800] the other hand, it's also close enough to what the user actually wrote because we're [11:23.800 --> 11:28.360] not down on the machine code yet. And we can figure out the inlining stuff, the constness, [11:28.360 --> 11:35.240] the virtual functions. We can get type information in the IR. And if we do it at link time, we [11:35.240 --> 11:40.720] are not even limited to the translation unit by translation unit scope that source code [11:40.720 --> 11:45.560] based approaches are. So, if you have your pretty default compile pipeline, you have [11:45.560 --> 11:49.400] your source code, which builds a translation unit, gets fed into the compiler, which outputs [11:49.400 --> 11:54.640] intermediate representation, and then the optimizer runs there and multiple source code [11:54.640 --> 12:00.720] translation unit optimized IR modules are fed into the linker. And we can do our call [12:00.720 --> 12:05.960] graph analysis inside the linker, solve our translation unit problem, and are able to [12:05.960 --> 12:12.160] have all our information ready. So, to do this, we developed the cage plugin. Cage stands [12:12.160 --> 12:19.720] for call graph embedding LLVM plugin. And it basically generates a call graph using [12:19.720 --> 12:26.120] some of LLVM tools, does some annotation, virtual function call analysis, and this can [12:26.120 --> 12:32.240] run as part of the optimizer in the pipeline, but it can also run as part of the LLVM linker. [12:32.240 --> 12:38.840] Also as a plugin, for which we use a slightly modified version of the LLVM linker, but [12:38.840 --> 12:45.400] the basic logic of running plugins in the LLVM linker was there. So, then we do a V [12:45.400 --> 12:49.960] table analysis, our metadata annotation, because it's all available to us. And then [12:49.960 --> 12:55.200] we embed the result into the binary, which enables dynamic augmentation. And I will come [12:55.200 --> 13:00.960] to that one later. So, at link time, we're doing static analysis basically. And as I [13:00.960 --> 13:05.960] already mentioned, we split our information in basically two types. One of them is structural [13:05.960 --> 13:11.800] information like call hierarchies, call paths, call depth, the number of children, how deep [13:11.800 --> 13:18.480] we are in our path. And you also have virtual function calls, which are mostly structural, [13:18.480 --> 13:22.680] because once you have virtual and polymorphic calls, you have like a set of functions that [13:22.680 --> 13:28.520] are probably being called by that pointer. And so, we can narrow down the possibilities [13:28.520 --> 13:33.240] of which functions are called, but we cannot actually statically figure out this function [13:33.240 --> 13:38.600] is getting called. So, it's slightly metadata based, but it's also mostly structural. On [13:38.600 --> 13:42.960] the metadata information side, we have instruction composition, so we can determine what is the [13:42.960 --> 13:47.720] relation between arithmetic operations and memory operations, for example, or we can [13:47.720 --> 13:54.880] generate local and global loop depth estimates. And then we have inlining information, which [13:54.880 --> 14:00.960] is metadata, because inlining is not like a must do for a compiler, just because you [14:00.960 --> 14:05.200] have specified inlining for a function doesn't mean that the compiler will actually inline [14:05.200 --> 14:11.160] the function. So, it's partly structural information and partly metadata, so you see [14:11.160 --> 14:16.960] there's no clear like line between those, they blur at some points, but we can represent [14:16.960 --> 14:23.520] all those in the metadata annotated call graph. And if you remember, we were able to do dynamic [14:23.520 --> 14:29.000] augmentation. Well, what is dynamic augmentation? If you remember, each object contains the [14:29.000 --> 14:35.720] call graph that we generated, which means that the call graphs can be aggregated at runtime [14:35.720 --> 14:40.360] if a shared library is loaded, because even if you are at link time, even if you can see [14:40.360 --> 14:45.160] all the statically linked objects, all the translation units that belong to your target, [14:45.160 --> 14:51.800] your binary might load a shared library, which then, well, you're unaware of. So, the idea [14:51.800 --> 14:57.840] is, as soon as the main executable is loaded, it passes its embedded call graph on startup [14:57.840 --> 15:03.120] to a runtime collecting library. And then, the main executable can load whatever shared [15:03.120 --> 15:10.320] object it wants. And if this shared object also contains a call graph, then this runtime [15:10.320 --> 15:15.480] collector gets passed this call graph on the first load of the shared object and can aggregate [15:15.480 --> 15:20.200] it like merging. So, we're basically back to the translation unit by translation unit [15:20.200 --> 15:27.080] based approach, but now we're doing shared library on binary and executable based merging. [15:27.080 --> 15:31.880] And then, we attach all this data together to one really big whole program call graph, [15:31.880 --> 15:36.440] now including shared objects. And then, we can export this, for example, and pass it [15:36.440 --> 15:39.960] back to Karpi for some further refinement of the instrumentation. [15:39.960 --> 15:54.880] Go ahead. All right. Thanks. So, to put it all together, for Karpi, we have the call [15:54.880 --> 16:00.040] graph analysis approach, Tim just explained. So, for each object file, we have the call [16:00.040 --> 16:06.800] graph analysis and then the embedding. And then, on the runtime side, we can sort of [16:06.800 --> 16:13.040] merge the main executable call graph with the shared libraries as they are loaded. And [16:13.040 --> 16:19.320] we defer the instrumentation using a dynamic instrumentation approach in order to sort [16:19.320 --> 16:30.120] of apply that selection dynamically. And, yeah, that's how it works. So, to summarize, [16:30.120 --> 16:37.520] we are developing the Karpi tool for instrumentation selection based on call graph analysis. And [16:37.520 --> 16:44.560] we have explored this cage plugin for generating this call graph information at link time, [16:44.560 --> 16:50.440] which allows whole program visibility and this dynamic documentation. And together [16:50.440 --> 16:57.120] with Karpi, we can, yeah, use the embedded call graph to run the selection at runtime [16:57.120 --> 17:07.760] and thereby improving Karpi to make the compilation process and the analysis process more streamlined. [17:07.760 --> 17:13.640] And this is sort of an active development. So, at this point, we don't have a very detailed [17:13.640 --> 17:20.000] evaluation about performance and stuff like that. So, there's some concerns, for example, [17:20.000 --> 17:27.240] when you go to very big programs and do LTO, there might be performance problems. So, there [17:27.240 --> 17:37.720] might be more work to make it viable in that regard. But, yeah, it works well in a prototype [17:37.720 --> 17:44.720] fashion. Yeah. Thank you. [17:44.720 --> 17:53.240] Any questions? [17:53.240 --> 18:07.320] Perhaps I was a bit distracted. So, I have two questions. If you can comment on Lambda [18:07.320 --> 18:19.160] functions, OMP sections, or OMP sections of the code, not specific constructs, and instruction [18:19.160 --> 18:27.360] cache. If you can comment how those are handled, those three aspects, let's say. [18:27.360 --> 18:34.280] So when it comes to OpenMP, we don't at the moment have any specific OpenMP profiling [18:34.280 --> 18:41.440] interface that we target. So, this might be something that is probably useful in the future. [18:41.440 --> 18:49.360] But for now, we just do the basic functions mutation and select on that. Yeah. Then the [18:49.360 --> 18:55.720] other point was, sorry, could you repeat? [18:55.720 --> 18:59.480] So, do you want to comment on that? [18:59.480 --> 19:05.840] So, lambdas and caches, so caching, no, there is no logic to handle caching in any way. [19:05.840 --> 19:11.200] And regarding lambdas and OpenMP, it's, if you're talking about profiling, you've got [19:11.200 --> 19:16.120] your answer right there. But if you're talking about generating call graphs in which lambdas [19:16.120 --> 19:23.680] and OpenMP runtime calls are available, then the call graph will actually figure out that [19:23.680 --> 19:32.800] there are OpenMP runtime calls, and will correctly, if I remember correctly, will figure out that [19:32.800 --> 19:38.400] this function calls back to the OpenMP runtime, because once you are in IR, the runtimes actually [19:38.400 --> 19:43.800] carved out, all the pragmas were removed from the actual source code. So, we are aware [19:43.800 --> 19:48.680] of OpenMP, but we are not using that information currently for any profiling. But you could [19:48.680 --> 19:55.320] do metadata-based copy selection with it. So, every call path that eventually leads [19:55.320 --> 20:03.080] to an OpenMP runtime call would be a valid instrumentation using copy. [20:03.080 --> 20:08.360] Just that the question of the instruction cache is whether, this is my ignorance, but [20:08.360 --> 20:14.640] if you are reintroducing a lot of new instructions here in the code, or in the code that is being [20:14.640 --> 20:22.280] read, I was thinking whether too much data ends up, maybe I didn't understand well. [20:22.280 --> 20:27.520] So, this is more of a performance-related question, right? Okay, so, yes, of course, [20:27.520 --> 20:31.840] you introduce a new instruction whenever a shared library is loaded, because we then [20:31.840 --> 20:37.840] add instructions that pass the call graph back to our runtime collecting facility, and [20:37.840 --> 20:42.920] we also introduce instructions because we are using a profiling approach, which are function [20:42.920 --> 20:48.200] calls. So, yes, we are impeding the instruction fetching, instruction caching flow, which [20:48.200 --> 20:54.000] is also why profiling has rather high overhead compared to sampling approaches, for example. [20:54.000 --> 21:00.840] But as Sebastian told, we have not really extensively profiled our application, quite [21:00.840 --> 21:13.960] ironic, so we are not aware how much the benefit or the impact actually would be. [21:13.960 --> 21:20.280] So in your slide, you say you have a fork of LLD. The obvious question is, what's stopping [21:20.280 --> 21:26.280] you from upstreaming this? So, yes, we have a fork of LLD that just basically [21:26.280 --> 21:34.080] exposes the flag load new pass manager plugin, and you pass the plugin, and it does the rest. [21:34.080 --> 21:39.280] What's currently holding us back from upstreaming is it's not very well developed, it was coupled [21:39.280 --> 21:44.720] together in half a week or something, and there already is an open merge request on [21:44.720 --> 21:52.320] fabricator that implements this exact functionality for, if I remember correctly, LLDM9, which [21:52.320 --> 22:03.000] was abandoned for a year until it was totally abandoned and closed, and so we didn't actually [22:03.000 --> 22:09.280] figure out how to make this more interesting. Don't take that as a signal. People just move [22:09.280 --> 22:14.440] jobs or whatever, so try it again and bash people, and I can help you with that as well, [22:14.440 --> 22:18.280] find the right people to get this, because it seems like a simple and obvious thing to [22:18.280 --> 22:24.640] have. It isn't actually that hard. Apparently, there wasn't much interest in the community [22:24.640 --> 22:37.000] back in 2020? Yes, that's too long ago. We will polish it a little much, and then hopefully [22:37.000 --> 22:48.760] get this one upstream. Thanks. Any other questions? No. Okay. Thank you very much. [22:48.760 --> 23:07.720] Thank you very much, Sebastian.