[00:00.000 --> 00:16.440] Yeah, yeah, exactly. Okay, good afternoon. Yeah, so I'm going to be talking about compiler [00:16.440 --> 00:24.000] intrinsics in sickle in DPC plus plus specifically. This is Intel's open source sickle implementation. [00:24.000 --> 00:29.360] This is what I work on. Yeah, so hopefully I'll be able to say something without saying [00:29.360 --> 00:35.880] too much in 10 minutes. Yeah, so code play. I work for code play. We had the first sickle [00:35.880 --> 00:42.640] implementation, compute CBP. We now work. We were acquired by Intel. So now we work [00:42.640 --> 00:47.640] on the Intel sickle implementation DPC plus plus. That's what I work on. We have lots [00:47.640 --> 00:51.560] of partners, you know, hardware companies, that kind of thing, whoever needs an open [00:51.560 --> 00:55.160] CL implementation, sickle implementation, and so on, come to us. [00:55.160 --> 01:01.760] Yeah, so sickle is a single source heterogeneous programming API. So you can write single source [01:01.760 --> 01:11.640] code that can run on NVIDIA, Intel, AMD GPUs, close to the mic. Okay, voice up. Yeah, so [01:11.640 --> 01:16.600] it's great for someone who's developing scientific applications to be able to write single source [01:16.600 --> 01:25.760] code that runs on whatever GPU the implementation enables, such as CUDA, level zero for Intel, [01:25.760 --> 01:31.280] AMD GPUs, and so on. Yeah, this is a really good thing. So I work specifically on the [01:31.280 --> 01:38.520] NVIDIA and the HIP, the AMD backends for DPC plus plus. Okay, so yeah, I just want to talk [01:38.520 --> 01:42.640] a little bit about compiler intrinsics and how kind of, you know, math function calls [01:42.640 --> 01:47.680] work in sickle and DPC plus plus at the moment, and how we can hopefully improve them so that [01:47.680 --> 01:53.120] we're contributing upstream. So what happens to sickle cause? So essentially, you get your [01:53.120 --> 01:58.800] sickle cause in your source code. This is redirected to spear V open CL cause F, you [01:58.800 --> 02:03.200] compile the spear V, you make a spear V module, this is a symbol within the spear V module, [02:03.200 --> 02:09.680] and then that is the implementation is provided by a CL level zero Vulkan driver. Okay, as [02:09.680 --> 02:16.720] I said, I don't work on the spear V backend at all. I work the PTX, the CUDA or the AMD [02:16.720 --> 02:21.440] GPU backends. So what do we do with these symbols so that we get to the native implementations? [02:21.440 --> 02:25.360] We're not trying to reinvent the wheel. We're not trying to do anything that the people [02:25.360 --> 02:30.560] who are making the GPUs aren't doing already. We're just trying to redirect to that. So [02:30.560 --> 02:36.840] how do we go from this to that, and then compile to our PTX module, our AMD GPU module, HSA [02:36.840 --> 02:48.320] module, and so on. So, yeah, how do we go from spear V OCL to NV cause F? So use the [02:48.320 --> 02:52.880] shim library, easy peasy, that's fine. Okay, you just redirect it, you compile it to bitcode, [02:52.880 --> 02:57.440] you link it, a compilation time, and you get to this, this native bitcode implementation. [02:57.440 --> 03:03.320] This is great. Okay, so we use libclc for this. So libclc is written in open CL. Okay, open [03:03.320 --> 03:08.520] CL does lots of stuff that SQL doesn't expose as easily like address spaces, that kind of [03:08.520 --> 03:13.560] thing. So we write an open CL. This is great. This makes our lives really, really easy. [03:13.560 --> 03:20.600] We can do it. Say, before we get into this, just why do we want to use a BC library in [03:20.600 --> 03:24.520] the first place? Why don't we use a.so? Why don't we just resolve to some symbol that [03:24.520 --> 03:30.640] then a runtime is called and we don't care about it? So on a GPU, the overhead of a function [03:30.640 --> 03:35.000] call is really high. Okay, it's because we lose information about, say, address spaces, [03:35.000 --> 03:40.560] that kind of thing. The GPU memory hierarchy is a bit more complex than, say, for CPU. [03:40.560 --> 03:44.120] So we really, really need to worry about this. We want to inline everything so we don't lose [03:44.120 --> 03:50.720] any information about our memory hierarchies. We also allow compile time branching of code [03:50.720 --> 03:53.800] based on the architecture, based on the back end, that kind of thing. We don't want to [03:53.800 --> 03:57.000] have these checks at runtime. We want high performance. That's the name of the game for [03:57.000 --> 04:03.000] what we're doing. This gives us greater optimization opportunities as well. You can do lots of [04:03.000 --> 04:08.360] dead code elimination, lots of fun stuff in the middle end because you're doing all these [04:08.360 --> 04:15.200] checks at the IR level. Okay, so this is just kind of what it looks like. So we just have [04:15.200 --> 04:21.520] Spirvio CR-Casef. We return NV-Casef. Great. Amazing. That's so easy. And then this is [04:21.520 --> 04:27.640] the implementation which is provided by NVIDIA. This is in-bit code. We link this, and then [04:27.640 --> 04:36.560] this is just in-lined into our IR. This is great. Okay. Yes, so we're linking this echo [04:36.560 --> 04:41.640] code with LipsyLC. Then we link that with the vendor-provided BC library. So we're linking, [04:41.640 --> 04:48.480] linking. We get to the implementation. It's all in-lined. It's all great. We love it. [04:48.480 --> 04:55.000] So this works well, but so this is a bit of code from LipsyLC. Because we're dealing [04:55.000 --> 04:59.160] in OpenCL-C, we could choose something else. We could write a native IR. We find that OpenCL [04:59.160 --> 05:04.520] is actually easier to use than an easier to maintain than writing a native IR. So we end [05:04.520 --> 05:09.880] up with some funny kind of problems with mangling and all this kind of thing. This isn't nice. [05:09.880 --> 05:15.440] Sometimes we need manual mangling. It's got to do with namespaces when they're interpreted [05:15.440 --> 05:23.640] by the OpenCL mangler, unfortunately. Yes, we need to sometimes as well. Sometimes OpenCL [05:23.640 --> 05:27.320] isn't as good as we want it to be. So we need to actually write a native IR as well. So [05:27.320 --> 05:37.120] it's a mix of LVM IR, LipsyLC. It's a bit messy. It's not great. Yes, so also we're [05:37.120 --> 05:42.560] exposing some compiler internals here. This is the NVVM reflect pass, which essentially [05:42.560 --> 05:46.800] takes your function call for NVVM reflect, replaces it with a numeric value. This is [05:46.800 --> 05:53.920] totally done at the IR level, so you can branch at the IR level based on this is a high architecture, [05:53.920 --> 06:00.000] a newer architecture. Do this new implementation, do this new built-in. There's an old architecture, [06:00.000 --> 06:04.640] as well for things like rounding modes. This pass is used. We're exposing this in source [06:04.640 --> 06:13.920] code through hacks. This isn't really, you know, it's not, it's not kosher. But it works. [06:13.920 --> 06:20.720] Who cares? Okay, but consider the new proposal to add FP accuracy attributes to math built-ins. [06:20.720 --> 06:26.680] This is where we have, say, FP built-in cars, and we specify the accuracy in ULP that we [06:26.680 --> 06:32.880] want it to be computed to. Okay, this is totally lost on us. Okay, so this is what it would [06:32.880 --> 06:38.160] look like. Yeah, you have this attribute here. You've, FP max error. This is really, really [06:38.160 --> 06:44.000] needed in SQL because SQL is targeting lots and lots of different platforms. All these platforms [06:44.000 --> 06:48.320] have different numerical accuracy guarantees. We really, really need this, but we don't use [06:48.320 --> 06:55.040] built-ins at all. We're sorry, we don't use LVM intrinsics at all. So this is, we need to get to [06:55.040 --> 06:58.800] a point where we can start using this compiler infrastructure. We're not using it as much as [06:58.800 --> 07:07.920] you want to. So we could do this using a libclc compiler kind of hack workaround. We do another, [07:07.920 --> 07:12.000] you know, pass, you just say compiler precision value. If it's that, do some precise square root. [07:12.640 --> 07:17.520] If it's not, do some approximate thing. Yeah, we could do that. Okay, the problem with libclc [07:17.520 --> 07:23.680] and this stuff, it's not upstreamable. Okay, it's, it's a collection of hacks. It's not totally [07:23.680 --> 07:29.040] hacked, but like it's a little bit messy. It's not written in the same API. It's lib, it's OpenCL [07:29.040 --> 07:33.760] and it's, it's LVM IR. It's messy. We can upstream this. We can all benefit from this. [07:37.120 --> 07:43.200] Okay, so the pro about doing some, another, adding another hack to the, the kind of passes, [07:43.200 --> 07:47.440] another hack to the bunch is that it's easy to do. Okay, we can do this and we can keep going with our [07:48.080 --> 07:52.080] libclc implementation. It's pretty straightforward. We've been doing this the whole time. Yeah, [07:52.080 --> 07:58.240] fine. We don't need to worry about the broader LVM concerns. However, we miss out on LVM community [07:58.240 --> 08:03.040] collaboration, which is why we're, we're here. And then how many of these workarounds do we [08:03.040 --> 08:08.720] actually need in order to keep up with the latest trends and then libclc as bad as it could be now, [08:08.720 --> 08:11.920] like it just degenerates into an absolute mess and we don't want that. [08:14.320 --> 08:20.240] Okay, so we think the answer for this is to try and redirect, try and, try and actually [08:20.240 --> 08:24.480] have it calling the compiler intrinsic. Okay, we want to use compiler intrinsic and then have [08:24.480 --> 08:30.080] some generic behavior of these intrinsics for offload targets. Okay, and this would be used by [08:30.080 --> 08:35.520] say OpenMP by, by, you know, CUDA Clang and so on, all these different targets, but we don't have [08:35.520 --> 08:41.360] this transformation. We, we're not comfortable with this connection. Okay, from an intrinsic to a [08:41.360 --> 08:47.280] vendor provided BC built in. Okay, why is that? Essentially, this needs to happen as early as [08:47.280 --> 08:54.640] possible in the, at the IR level. So we're adding an external dependency in our LLVM kind of, [08:54.640 --> 09:01.120] you know, pipeline. We need to link this BC library early on in our, in our, yeah, pipeline. [09:01.840 --> 09:05.440] We don't do this. We're not comfortable with doing this. We need to figure out a way that people [09:05.440 --> 09:11.520] will be happy with us doing this. Okay, obviously we're used to things resolving to external symbols, [09:11.520 --> 09:16.480] but then that's a runtime thing. It's not, it's not a compile time thing. Okay, this needs to be [09:16.480 --> 09:24.800] inline. We need to do lots and lots of stuff with this at the IR level. Okay, so there will still be [09:24.800 --> 09:28.640] cases where we need libclc potentially. It's not going to, you know, just disappear from our [09:28.640 --> 09:37.360] SQL implementation, hopefully, but we need to start pushing towards better kind of resolution, [09:37.360 --> 09:43.040] better use of these intrinsics in LLVM for offload in general. Okay, so why? [09:43.040 --> 09:48.960] Why? Share infrastructure, keep an eye, keep on the cutting edge of new developments, [09:48.960 --> 09:54.160] less compiler hacks, and we make SQL compilation eventually work upstream. It doesn't at the [09:54.160 --> 09:58.240] moment, but eventually we want it to, of course. We're trying to upstream as much as possible, [09:58.240 --> 10:00.400] but libclc is not upstreamable, and that's a problem. [10:02.880 --> 10:07.920] Okay, so the first step, try and have this discussion about making the intrinsics work [10:07.920 --> 10:14.800] for offload. Okay, so time, okay, time's up. So we need to have this link step at the IR level [10:14.800 --> 10:19.440] early on in the IR kind of pipeline. This is problematic for some people, but we need to talk [10:19.440 --> 10:26.720] about this. So please join in the discussion here. This is NVPTX co-gen for LLVM sign-in friends, [10:26.720 --> 10:31.440] if you have any opinions on this. Sorry, I kind of ran over a little bit, but yeah, any questions? [10:31.440 --> 10:41.680] Yeah, I was wondering, would it make sense to try to get rid of the mess by going to an MLIR [10:41.680 --> 10:46.960] type of approach, or like, what are the benefits or downsides to MLIR? [10:47.680 --> 10:54.640] So I'm not an expert. So the question was, are there benefits? Can we avoid this by going to MLIR? [10:54.640 --> 11:03.360] I think it's more fundamental than MLIR. I'm not an expert on MLIR, but I think we need basic [11:03.360 --> 11:09.600] resolution of intrinsics. Presumably with MLIR, you'll have, you know, other MLIR intrinsics that [11:09.600 --> 11:13.520] will need the same kind of treatment. We'll have the same questions there. So this is the first [11:14.640 --> 11:17.600] case study. This is the most simple case. We're not trying to implement the new [11:18.320 --> 11:22.720] FU built-ins with the accuracy thing. We're just trying to decide how do we make this dependency [11:22.720 --> 11:29.360] on this external BCLib work, and do it in a very, very confined sort of way. Yeah, thank you. [11:30.400 --> 11:38.480] Yeah. Two questions. First one is a tutorial to generate NVPTX from MLIR. There is a work section [11:38.480 --> 11:42.960] about linking with the Bitcoin library from NVIDIA. So what's the difference with this? [11:43.520 --> 11:48.960] And the second question is, you mentioned NVM, which is the closed source [11:48.960 --> 11:54.160] ptx generator from NVIDIA, and there is also the LLVM NVPTX backend. [11:56.000 --> 12:00.400] Are we reaching speed parity with the closed source one? [12:01.760 --> 12:05.280] It depends on the application. We find that with, so the second question first, [12:06.400 --> 12:14.320] is there still a big performance gap between the native, say, NVCC compiler and LLVM client? So [12:14.320 --> 12:22.960] in terms of DPC++, which is a fork of LLVM, we're attaining, say, roughly comparable performance, [12:23.600 --> 12:29.920] whether you're using SQL or you're using CUDA with NVCC, and then any improvements that we make to [12:29.920 --> 12:36.560] the kind of compiler or whatever, they're shared by client CUDA as well. So the first question again [12:36.560 --> 12:47.600] was, how is this different from? So essentially, when you're linking Bitcode or whatever, [12:48.640 --> 12:56.480] you're not using any LLVM intrinsics. You're just redirecting things yourself. You're not [12:56.480 --> 13:02.800] using intrinsics. So you need to do everything explicitly. You need to either have a specific [13:02.800 --> 13:08.720] kind of driver path that will do this for you or you need to specifically say, I want to link this [13:08.720 --> 13:14.000] in at this time or whatever. And so it's more manual. It's not happening automatically. It's [13:14.000 --> 13:18.560] not happening really within the compiler. It's happening at link time, LLVM link time. [13:18.560 --> 13:31.360] All right. Thank you, Hugh. Thank you.