[00:00.000 --> 00:08.800] We'll talk about AVX-312 in FFNPEG. [00:08.800 --> 00:12.720] He's also the co-organiser of this dev room. [00:12.720 --> 00:14.600] Please welcome Kirano. [00:14.600 --> 00:23.640] So yes, I'm going to be talking about AVX-512 in FFNPEG. [00:23.640 --> 00:25.600] What is AVX-512? [00:25.600 --> 00:28.480] AVX stands for Advanced Vector Extensions. [00:28.480 --> 00:31.200] There will be a lot of acronyms and jargon, unfortunately, [00:31.200 --> 00:35.320] in this one, but I will try and explain all of them. [00:35.320 --> 00:38.320] So AVX-512 is a relatively new single instruction [00:38.320 --> 00:43.960] multiple data instruction set for Intel CPUs from about 2017 [00:43.960 --> 00:48.280] and more recently in the last six months or so with AMD CPUs. [00:48.280 --> 00:53.240] In particular, it has a larger 512-bit register size. [00:53.240 --> 00:56.360] Many new instructions, which we'll talk about in a minute. [00:56.360 --> 01:00.360] Comparisons, which are quite new, and also lots of other things [01:00.360 --> 01:02.360] that are not so interesting in multimedia. [01:02.360 --> 01:06.360] Cryptography, neural networks, and I'm sure there are other people [01:06.360 --> 01:10.360] at Fastem who could talk a lot more about these kind of things. [01:10.360 --> 01:13.360] As I mentioned, lots of fancy words, but the thing to bear in mind [01:13.360 --> 01:16.360] is in FFNPEG, high schoolers have gone and written assembly. [01:16.360 --> 01:19.360] This is heavily jargon-centric. [01:19.360 --> 01:22.360] It sounds complicated, but actually quite a big reasonable chunk [01:22.360 --> 01:25.360] of assembly in FFNPEG has been written by people who are in high school. [01:25.360 --> 01:28.360] Why is this relevant now? [01:28.360 --> 01:35.360] I've mentioned AVX-512 has been around since 2017, so why is it 2023? [01:35.360 --> 01:40.360] Well, Skylake was the first CPU generation from Intel to have AVX-512 support, [01:40.360 --> 01:44.360] but it had very large performance throttling when you used them, [01:44.360 --> 01:50.360] so your effective CPU capability speed went down quite dramatically. [01:50.360 --> 01:55.360] And so this was fine if you were doing high-performance computing in academia, [01:55.360 --> 01:58.360] for example, like fluid dynamics, where you were using these instructions [01:58.360 --> 02:01.360] 100% of the time, that was fine. [02:01.360 --> 02:04.360] But in multimedia is a mixture of assembly and C code, [02:04.360 --> 02:07.360] where you're not necessarily always using these instructions. [02:07.360 --> 02:12.360] So this relative main is sort of unused for the last couple of years. [02:12.360 --> 02:16.360] You could still use these new instructions, though, with the smaller register sizes, [02:16.360 --> 02:20.360] and I'll show an example of this later. [02:20.360 --> 02:27.360] But the first Intel CPUs not to have throttling were the Islake 10th and 11th gen Intel CPUs. [02:27.360 --> 02:32.360] They were the first to have no throttling, and this meant these ZMM-based instructions [02:32.360 --> 02:36.360] could be first-class citizens. [02:36.360 --> 02:40.360] How to get started, one of the tricky things as well in the last few years [02:40.360 --> 02:44.360] has been actually getting access to devices that have this, [02:44.360 --> 02:47.360] and unfortunately Intel have not made it easy. [02:47.360 --> 02:52.360] From their 12th generation, CPUs have actually removed support in consumer equipment. [02:52.360 --> 02:56.360] It's still available on AMD as in four CPUs, though. [02:56.360 --> 02:59.360] And if using the cloud is your kind of thing, [02:59.360 --> 03:02.360] available also from many cloud providers in the server CPU range, [03:02.360 --> 03:05.360] such as AWS or others. [03:05.360 --> 03:08.360] Personally, I think the easiest way is to buy an 11th generation Intel NUC. [03:08.360 --> 03:10.360] That's what I did for FMpeg. [03:10.360 --> 03:13.360] I bought two of them for the projects and host them. [03:13.360 --> 03:16.360] The easiest way, it's only a few hundred euros. [03:16.360 --> 03:18.360] It's quiet, it fits under your desk. [03:18.360 --> 03:24.360] And that's the easiest way to get started, you get a full AVX512 stack. [03:24.360 --> 03:30.360] So let's look at some of the existing work in multimedia that's using AVX512. [03:30.360 --> 03:34.360] And probably most importantly, we had the sort of introduction from JB earlier today, [03:34.360 --> 03:36.360] the David project, which is an AV1 decoder. [03:36.360 --> 03:42.360] This added AVX512 support, I think a year or two ago. [03:42.360 --> 03:47.360] It's particularly beneficial in AV1 because AV1 has large block sizes, [03:47.360 --> 03:50.360] sort of in comparison to more traditional standards, [03:50.360 --> 03:53.360] traditional codecs like H264 and others, which are smaller. [03:53.360 --> 03:57.360] So AVX512 in David gave, I think, 10 to 20% overall. [03:57.360 --> 03:59.360] So not just the functions themselves, [03:59.360 --> 04:03.360] the overall decode performance was improved. [04:03.360 --> 04:07.360] And it's actually been a running topic, which is quite interesting over today, [04:07.360 --> 04:13.360] in FMPEG that we use, and David, and also we use this classic FMPEG H264 approach to assembly, [04:13.360 --> 04:22.360] which is no intrinsics, no inline assembly, no special SIMD sort of libraries to make life easier. [04:22.360 --> 04:28.360] It's raw assembly language, and I'll show some examples of that. [04:28.360 --> 04:33.360] And also we don't also compile them in and force you to have a particular CPU generation. [04:33.360 --> 04:37.360] And I know this is quite controversial. I think it's MongoDB, for example. [04:37.360 --> 04:44.360] They forced one-year a particular CPU generation, and this was super controversial because not everybody had that. [04:44.360 --> 04:49.360] So what we do in FMPEG is we detect CPU capabilities, and I'll show you the function in a minute. [04:49.360 --> 04:52.360] And then we use function pointers, so we set them once at the beginning, [04:52.360 --> 04:57.360] and therefore the overhead of doing that measurement is checked once, [04:57.360 --> 05:01.360] and then there's function pointers that are executed after that. [05:01.360 --> 05:07.360] And unfortunately, on Intel, there's a very messy Venn diagram of capabilities. [05:07.360 --> 05:11.360] But in practice, we really, so far, and they may change their mind, [05:11.360 --> 05:13.360] but care about these kind of two things. [05:13.360 --> 05:17.360] So these are the CPU flags you get in FMPEG. [05:17.360 --> 05:23.360] There are others, but the AVX-512-related ones are broadly speaking legacy Skylake, [05:23.360 --> 05:26.360] and the newer ICL are put in bold for Ice Lake. [05:26.360 --> 05:30.360] But you can see there are actually a lot of different subcategories in there. [05:30.360 --> 05:33.360] But in practice, it's at the moment one or the other, [05:33.360 --> 05:38.360] but as I mentioned, Intel are very keen on adding and removing features [05:38.360 --> 05:43.360] and possibly even charging your subscription for certain features is one of their new ideas. [05:43.360 --> 05:48.360] So it could be that newer additions to this are subscription-based, [05:48.360 --> 05:52.360] or you buy and pay for it later, or something much more complicated. [05:52.360 --> 05:54.360] So who knows? [05:54.360 --> 06:00.360] So I guess, unfortunately, there's some sort of dependency [06:00.360 --> 06:03.360] in explaining a few of the topics and some of the benefits [06:03.360 --> 06:05.360] without explaining some of the backstory. [06:05.360 --> 06:12.360] So historically, in old AVX, you had all the 256-bit registers, [06:12.360 --> 06:15.360] and these were split in practice into lanes. [06:15.360 --> 06:20.360] So in practice, you've got 228-bit lanes, [06:20.360 --> 06:23.360] and instructions, broadly speaking, operated in these lanes. [06:23.360 --> 06:27.360] So if you ran a instruction, it worked on data, [06:27.360 --> 06:29.360] and it was actually quite difficult. [06:29.360 --> 06:33.360] It was possible, but difficult to move data between these lanes. [06:33.360 --> 06:40.360] And it's one of the historical limitations on existing AVX and AVX2 code that we have [06:40.360 --> 06:44.360] in FMNPEG is lane crossing and all sorts of trickery [06:44.360 --> 06:49.360] that essentially cost CPU cycles to take up this time, [06:49.360 --> 06:53.360] that takes time to compensate for the lanes. [06:55.360 --> 06:57.360] I have to talk a bit about KMAS registers as well. [06:57.360 --> 07:00.360] So AVX512 has these new set of registers called KMASks, [07:00.360 --> 07:07.360] K0 to K7, and this allows a destination register to remain unchanged. [07:07.360 --> 07:10.360] So, for example, underneath, you could have an addition, [07:10.360 --> 07:12.360] but actually it's a simple case, [07:12.360 --> 07:15.360] and obviously you could just add zero, and it's unchanged, [07:15.360 --> 07:19.360] but you could actually use the KMAS to say, [07:19.360 --> 07:22.360] actually, I don't want addition to be applied to these elements. [07:22.360 --> 07:24.360] I want this to be a pure pass-through, [07:24.360 --> 07:28.360] or you could even force some of the elements to zero if you wanted to. [07:28.360 --> 07:32.360] There's a specific, I think it's a flag that lets you do that. [07:32.360 --> 07:35.360] And there's a whole set of new instructions to go and manipulate these KMAS registers, [07:35.360 --> 07:39.360] and certainly David, in particular, uses, makes good use of KMASks. [07:39.360 --> 07:44.360] So now that I've sort of explained some of the back story, [07:44.360 --> 07:48.360] I think it's fair to say one of the most important instructions, [07:48.360 --> 07:53.360] if not the most important instruction, is our shuffles in multimedia. [07:53.360 --> 07:57.360] Also known as permutes, and there might be a technical difference [07:57.360 --> 07:58.360] between a shuffle and a permute. [07:58.360 --> 07:59.360] Someone might be able to correct me. [07:59.360 --> 08:01.360] There might be some mathematical difference, [08:01.360 --> 08:03.360] but these are the most important, [08:03.360 --> 08:06.360] or one of the most important, instructions in multimedia. [08:06.360 --> 08:09.360] And as you can see on the right, basically it lets you, [08:09.360 --> 08:13.360] shuffles let you have various bits of data [08:13.360 --> 08:15.360] and rearrange them in any way that you want. [08:15.360 --> 08:21.360] Duplicate them, as you can see, or even set individual elements to zero. [08:21.360 --> 08:26.360] And this is, for example, famously one use case of this [08:26.360 --> 08:28.360] is in the zigzag scan of FFMPEG, [08:28.360 --> 08:32.360] which groups larger coefficients in a block together. [08:32.360 --> 08:36.360] But the way that that's done is via a zigzag scan. [08:36.360 --> 08:39.360] The thing about vpermb, which is the new AVX-512 instruction, [08:39.360 --> 08:41.360] is it lets you cross a lane. [08:41.360 --> 08:43.360] This wasn't something that was possible in before. [08:43.360 --> 08:49.360] And as I'll show you later, this makes things substantially faster in many cases. [08:49.360 --> 08:52.360] pshuffb, probably one of the most commonly used instructions [08:52.360 --> 08:54.360] in all of open source multimedia. [08:54.360 --> 08:58.360] You do get grep, pshuffb, there'll be a huge, you know, [08:58.360 --> 09:01.360] that your screen will be full of pshuffb. [09:01.360 --> 09:06.360] They're used everywhere in open source multimedia. [09:06.360 --> 09:08.360] pshuffb had a kind of useful benefit [09:08.360 --> 09:10.360] that if you set the index to minus one, [09:10.360 --> 09:12.360] you had to automatically do the zeroing out. [09:12.360 --> 09:14.360] With vpermb, this isn't the case. [09:14.360 --> 09:17.360] You have to actually use kmasks to do that. [09:17.360 --> 09:21.360] So that just makes things slightly more complicated. [09:21.360 --> 09:25.360] There's all sorts of other interesting permutes that AVX-512 offers. [09:25.360 --> 09:28.360] I think David also, again, makes good use of this vperm2b, [09:28.360 --> 09:31.360] so you can actually not just have one set of data, [09:31.360 --> 09:33.360] you can actually permute from two different registers. [09:33.360 --> 09:36.360] So you could have ijk, et cetera, et cetera in a different register, [09:36.360 --> 09:39.360] and your output could be a mixture of both of those. [09:39.360 --> 09:43.360] So that's kind of interesting. [09:43.360 --> 09:45.360] Variable shifts. [09:45.360 --> 09:48.360] You have now variable right shifts. [09:48.360 --> 09:52.360] So I've given the example of a vpsrlvw logical right shift [09:52.360 --> 09:56.360] and vpslvw variable left shift logical. [09:56.360 --> 10:00.360] Big letter soup, quite confusing. [10:00.360 --> 10:04.360] In fact, when writing this slide, I misspelt the word shift. [10:04.360 --> 10:07.360] You can have a think about how that may have been spelt. [10:07.360 --> 10:10.360] Thankfully, that's the good, thankfully, the rehearsals, [10:10.360 --> 10:11.360] and we'll pick this up. [10:11.360 --> 10:13.360] But this word soup is exceptionally confusing, [10:13.360 --> 10:16.360] both when writing slides and writing code, it seems. [10:16.360 --> 10:21.360] So historically, to do variable shifts, [10:21.360 --> 10:23.360] so if you want to take, obviously, just to step back, [10:23.360 --> 10:26.360] take an element and shift each element by a different amount, [10:26.360 --> 10:28.360] this was quite complicated. [10:28.360 --> 10:32.360] There's various bits of trickery, various idioms that people use [10:32.360 --> 10:34.360] to try and emulate that, but they had limitations. [10:34.360 --> 10:38.360] I think, for example, you were not shifting by zero, [10:38.360 --> 10:42.360] possibly wasn't allowed in one of the various bits of trickery. [10:42.360 --> 10:44.360] And so if you needed a zero shift, [10:44.360 --> 10:46.360] you had to do it a different way, et cetera, et cetera. [10:46.360 --> 10:50.360] But now you have this variable shift, and it's all usable. [10:50.360 --> 10:53.360] Equally on the left shift, the naive way of doing an emulated [10:53.360 --> 10:56.360] left shift is just to multiply, but these instructions [10:56.360 --> 10:58.360] are actually faster than the multiply, [10:58.360 --> 11:00.360] so there's still some benefit. [11:02.360 --> 11:05.360] VP Turnlog D, this is, I think, no presentation [11:05.360 --> 11:10.360] about AVX 512 could not fail to mention VP Turnlog D. [11:10.360 --> 11:13.360] This instruction is literally a kitchen sink. [11:13.360 --> 11:16.360] It's quite remarkable in what it can actually do. [11:16.360 --> 11:18.360] You can literally program a truth table [11:18.360 --> 11:20.360] within an individual instruction itself, [11:20.360 --> 11:24.360] and, in theory, could replace up to eight different instructions. [11:24.360 --> 11:29.360] So you could do a whole presentation on VP Turnlog D. [11:29.360 --> 11:33.360] So I thought it would be best to try and pick one of the simplest ones, [11:33.360 --> 11:35.360] which is a ternary operation. [11:35.360 --> 11:41.360] So this is a bitwise equivalent to the C ternary operation. [11:41.360 --> 11:44.360] So in each register, each bit is iterated through. [11:44.360 --> 11:49.360] And you can see, for example, one, the ternary operation. [11:49.360 --> 11:51.360] So if that bit set choose this or versus this, [11:51.360 --> 11:53.360] and you can see the output of that is that. [11:53.360 --> 11:57.360] And so, essentially, it's a bitwise operation of ZMM [11:57.360 --> 12:00.360] is equal to ZMM0, a question mark, ZMM1, ZMM2, [12:00.360 --> 12:02.360] but on a bitwise level. [12:02.360 --> 12:05.360] And there's all sorts of other interesting things you can do, [12:05.360 --> 12:07.360] and this article is very good. [12:07.360 --> 12:11.360] It shows all sorts of interesting things you can do, [12:11.360 --> 12:14.360] bit selects, all sorts of various different operations [12:14.360 --> 12:18.360] that you can do on multiple XORs, for example. [12:18.360 --> 12:22.360] So, yeah, also very interesting. [12:22.360 --> 12:24.360] So let's look at a real-world example. [12:24.360 --> 12:26.360] I don't know how well you can see that. [12:26.360 --> 12:28.360] I was hoping the dark mode would actually make life easier, [12:28.360 --> 12:30.360] but maybe it's made things worse. [12:30.360 --> 12:33.360] But I'll talk about some of the mouse. [12:33.360 --> 12:35.360] Is it the mouse? [12:35.360 --> 12:37.360] Because the mouse on the Mac is dark. [12:37.360 --> 12:41.360] But anyway, this is v2.10enc. [12:41.360 --> 12:43.360] It's probably one of the most simplest assembly functions [12:43.360 --> 12:46.360] in fmpeg, but what it does is it takes [12:46.360 --> 12:49.360] three 8-bit samples from different memory locations. [12:49.360 --> 12:52.360] It sort of, as part of its work, extends to 10 bits [12:52.360 --> 12:57.360] and then packs those three 10-bit words into 32 bits. [12:57.360 --> 13:00.360] So what's interesting in this function is [13:00.360 --> 13:02.360] we're already starting to do lane crossing [13:02.360 --> 13:04.360] that wasn't possible before. [13:04.360 --> 13:08.360] So we load the y-samples, so the luma samples, [13:08.360 --> 13:11.360] into the lower 256 bits. [13:11.360 --> 13:14.360] We do the u-section of the chroma into the third, [13:14.360 --> 13:18.360] or the second, if zero-indexed, portion of the register, [13:18.360 --> 13:23.360] and then equally the same for v. [13:23.360 --> 13:27.360] And then we do one, excuse me, [13:27.360 --> 13:33.360] and then one single v per mb [13:33.360 --> 13:35.360] can rearrange all of that in one go. [13:35.360 --> 13:40.360] This was a lot more complicated back in the olden days. [13:40.360 --> 13:43.360] P mad sub sw is some trickery [13:43.360 --> 13:45.360] that unfortunately there's not going to be enough time [13:45.360 --> 13:48.360] to explain, but eventually is a multiply and add, [13:48.360 --> 13:50.360] and we use that to emulate a shift. [13:50.360 --> 13:54.360] And then for the second element, [13:54.360 --> 13:58.360] in the three elements, we need to do a d-word shift [13:58.360 --> 14:01.360] because it actually spans the middle. [14:01.360 --> 14:05.360] So therefore then we have sort of conflicting bits [14:05.360 --> 14:06.360] in each register. [14:06.360 --> 14:07.360] So how do we do a bit selection? [14:07.360 --> 14:09.360] And this was quite a, I think it's a two or three, [14:09.360 --> 14:14.360] even up around two through three different instructions [14:14.360 --> 14:15.360] in the previous code. [14:15.360 --> 14:19.360] And this can now be done in a single vpternlogd, [14:19.360 --> 14:23.360] so essentially c ternary b or a. [14:23.360 --> 14:26.360] So if bit c is set, choose the bit from b [14:26.360 --> 14:28.360] or choose it from a otherwise. [14:28.360 --> 14:32.360] And you'll see in a second that actually provides quite a big, [14:32.360 --> 14:36.360] well certainly a measurable speed improvement. [14:36.360 --> 14:37.360] So these are the benchmarks. [14:37.360 --> 14:41.360] So this is, so I wanted to show a bit about how you can [14:41.360 --> 14:44.360] get benefits from AVX 512 even on the older hardware [14:44.360 --> 14:46.360] with the shorter existing registers. [14:46.360 --> 14:48.360] These are not scientifically benchmarked, [14:48.360 --> 14:50.360] I just ran them yesterday. [14:50.360 --> 14:52.360] When you do benchmarking you should run them [14:52.360 --> 14:54.360] 10 or 100 of times, average them, [14:54.360 --> 14:56.360] do standard deviations, et cetera. [14:56.360 --> 15:00.360] But just for the simple case, [15:00.360 --> 15:05.360] you can see that the c code versus the AVX 2 code [15:05.360 --> 15:06.360] is around 10 times faster. [15:06.360 --> 15:08.360] And you can see just by replacing, [15:08.360 --> 15:10.360] I think it's a set of two or three different pans [15:10.360 --> 15:13.360] or various boolean functions, [15:13.360 --> 15:18.360] you can get a measurable increase just with one instruction [15:18.360 --> 15:23.360] replacing three, even on the older YMM registers. [15:23.360 --> 15:26.360] But where the big gains come are on Ice Lake, [15:26.360 --> 15:34.360] you can see the c code versus the AVX 512 ICL, [15:34.360 --> 15:35.360] there's a huge difference. [15:35.360 --> 15:39.360] So by using vperm b and the ZMM, [15:39.360 --> 15:43.360] you can already make the legacy AVX 512 twice as fast. [15:43.360 --> 15:45.360] And if something was 10 times faster, [15:45.360 --> 15:47.360] that now becomes 20 times faster. [15:47.360 --> 15:50.360] And I often have to say that's not a multiply, [15:50.360 --> 15:51.360] that's a times. [15:51.360 --> 15:53.360] So it's massive improvement. [15:53.360 --> 15:56.360] This was code that could, if you have a large resolution file, [15:56.360 --> 15:58.360] take up an entire CPU core, [15:58.360 --> 16:01.360] and now it takes essentially 5% of a core. [16:01.360 --> 16:05.360] It's really tiny. [16:05.360 --> 16:08.360] What AVX 512 code is next? [16:08.360 --> 16:11.360] Anything really that's line-based or frame-based, [16:11.360 --> 16:13.360] such as filtering or scaling, [16:13.360 --> 16:17.360] I think the next thing we're working on is deinterlacing. [16:17.360 --> 16:18.360] Anything involving comparisons, [16:18.360 --> 16:20.360] I haven't really talked about comparisons, [16:20.360 --> 16:24.360] but there are bits of code that often need to do comparisons. [16:24.360 --> 16:27.360] That's going to be an obvious place for AVX 512. [16:27.360 --> 16:30.360] Lots of places that do triple booleans, [16:30.360 --> 16:34.360] multiple XORs or multiple XORs on ands, [16:34.360 --> 16:37.360] and I think it's almost always possible [16:37.360 --> 16:40.360] to replace that with a VP10 log D. [16:40.360 --> 16:42.360] Likewise in the code base, [16:42.360 --> 16:44.360] there's various different idioms and trickery [16:44.360 --> 16:47.360] to try and emulate a variable left shift and right shift, [16:47.360 --> 16:49.360] or multiplies for the left shifts [16:49.360 --> 16:51.360] and trickery for the right shifts. [16:51.360 --> 16:56.360] This could be used with the letter soup instructions [16:56.360 --> 16:59.360] to try and produce that. [16:59.360 --> 17:01.360] Intel provides an official manual to all of this. [17:01.360 --> 17:04.360] It's very verbose, which is great in many cases [17:04.360 --> 17:06.360] because it provides really precise detail [17:06.360 --> 17:08.360] of how the instructions work, [17:08.360 --> 17:10.360] but unfortunately is not at all approachable. [17:10.360 --> 17:13.360] There's a few websites that try and simplify things. [17:13.360 --> 17:15.360] I think this website on officedaytime.com [17:15.360 --> 17:17.360] is some kind of Japanese website, [17:17.360 --> 17:19.360] English that explains, [17:19.360 --> 17:22.360] tries to group all the instructions [17:22.360 --> 17:25.360] in some kind of logical ordering, [17:25.360 --> 17:28.360] and that makes it a lot simpler to understand. [17:28.360 --> 17:30.360] Any questions? [17:30.360 --> 17:32.360] Hopefully I'll be able to answer them, [17:32.360 --> 17:34.360] but thankfully at FosterM there's always somebody [17:34.360 --> 17:36.360] with more knowledge than you in the room. [17:36.360 --> 17:40.360] I can't see where they are, but I did see them at one point. [17:40.360 --> 17:42.360] Thanks. [17:42.360 --> 17:48.360] Thank you. [17:48.360 --> 17:51.360] Any questions in the room? [17:51.360 --> 17:55.360] Regarding the direct assembly writing of AVX-5.0, [17:55.360 --> 17:59.360] there's about 7,000 instructions of AVX-5.0. [17:59.360 --> 18:01.360] Why? [18:01.360 --> 18:03.360] If you choose the direct assembly, [18:03.360 --> 18:05.360] then you essentially might miss out [18:05.360 --> 18:07.360] on potential instruction scheduling [18:07.360 --> 18:10.360] between different architectures. [18:10.360 --> 18:12.360] Compilers might schedule better [18:12.360 --> 18:15.360] if you want to get a performance benefit in the future. [18:15.360 --> 18:21.360] But then you have to ship a binary for each version. [18:21.360 --> 18:23.360] Sorry, repeat the question. [18:23.360 --> 18:26.360] You have to write in 3.6, that's what I'm saying. [18:26.360 --> 18:28.360] In order to compile... [18:28.360 --> 18:31.360] The question is the classic question, [18:31.360 --> 18:34.360] can the compiler do a better job than a human question? [18:34.360 --> 18:38.360] In David, certainly the register allocation [18:38.360 --> 18:40.360] has not been very good in compilers historically. [18:40.360 --> 18:44.360] David has shown this quite dramatically [18:44.360 --> 18:47.360] because it has its own custom ABI internally, [18:47.360 --> 18:49.360] and you wouldn't be able to do that with the compiler [18:49.360 --> 18:52.360] like come up with your own internal ABI between functions. [18:52.360 --> 18:56.360] So there's certainly 10% plus on the individual function, [18:56.360 --> 18:59.360] speed gains versus doing it in intrinsics. [18:59.360 --> 19:02.360] Some bits of some instructions are not available in intrinsics [19:02.360 --> 19:04.360] like always. [19:04.360 --> 19:07.360] It's a compromise. [19:07.360 --> 19:10.360] Overall, it's been the way in FM Big X264 [19:10.360 --> 19:13.360] for the last 10 years, and I think all intrinsics [19:13.360 --> 19:15.360] and in line assemblies banned, [19:15.360 --> 19:17.360] and there's only one or two bits left, [19:17.360 --> 19:21.360] and there's a very good reason why it needs to be there. [19:21.360 --> 19:24.360] I have mixed experience about this. [19:24.360 --> 19:26.360] I agree on the... [19:26.360 --> 19:28.360] Ideally, assembly is better, [19:28.360 --> 19:30.360] but we had some code in 3.6, [19:30.360 --> 19:33.360] we compiled it with the latest Clang, 15, [19:33.360 --> 19:36.360] and we saw a 15 to 20% speed increase. [19:36.360 --> 19:40.360] But did you try writing it to begin with in... [19:40.360 --> 19:42.360] Yes, it was in 3.6. [19:42.360 --> 19:44.360] Write it in... [19:44.360 --> 19:47.360] Write it originally in assembly and compare, [19:47.360 --> 19:49.360] but it's... [19:49.360 --> 19:51.360] So for example, some of this... [19:51.360 --> 19:53.360] Sorry, you've gone to... [19:53.360 --> 19:56.360] Some of the bit-twizzling in there, [19:56.360 --> 20:00.360] for example, a compiler would never really have the understanding to do... [20:00.360 --> 20:02.360] In fact, I did try chatGPT, [20:02.360 --> 20:05.360] and chatGPT at least sort of understood a few of the concepts. [20:05.360 --> 20:08.360] It's interesting because not quite out of a day job, [20:08.360 --> 20:11.360] but I did ask chatGPT to write this function, actually, [20:11.360 --> 20:13.360] just sort of to see what... [20:13.360 --> 20:15.360] And it did have some vague idea what was going on. [20:15.360 --> 20:18.360] It didn't need to sort of be helped, which is quite interesting. [20:18.360 --> 20:20.360] Yep. [20:20.360 --> 20:23.360] Is there any collaboration between the multimedia, [20:23.360 --> 20:26.360] the people who write the codex, [20:26.360 --> 20:29.360] and the guys writing the compiler who tell them, [20:29.360 --> 20:33.360] look, perhaps you could target certain patterns? [20:33.360 --> 20:36.360] Martin is a collaboration between people writing the compilers [20:36.360 --> 20:38.360] and multimedia community. [20:38.360 --> 20:41.360] Yes, in ARM in particular, I think, [20:41.360 --> 20:43.360] is Martin here? [20:43.360 --> 20:46.360] No, Martin is not here, but Martin spends a lot of time [20:46.360 --> 20:49.360] talking to the compiler community and the linker community [20:49.360 --> 20:54.360] on mostly miscompilations is more his thing. [20:54.360 --> 20:56.360] And I think, yeah, [20:56.360 --> 20:59.360] and I think there is also some sharing of mostly around the C code, [20:59.360 --> 21:02.360] if the C code is badly miscompiled [21:02.360 --> 21:07.360] or thought of the wrong approach, [21:07.360 --> 21:09.360] because you can see, actually, [21:09.360 --> 21:12.360] and in some versions of the compiler will really do a bad job [21:12.360 --> 21:15.360] on the C and the assembly can be 40 times faster, [21:15.360 --> 21:17.360] and that's... [21:17.360 --> 21:19.360] Don't know if that's something you can really trust [21:19.360 --> 21:21.360] if one day you change compiler version [21:21.360 --> 21:26.360] and a function that you thought was immeasurable [21:26.360 --> 21:30.360] is now 40 times slower than it is. [21:30.360 --> 21:32.360] And then the question from the internet is, [21:32.360 --> 21:34.360] did you have the occasion to look at [21:34.360 --> 21:36.360] RVA-SVE vector instructions for FAMPEG? [21:36.360 --> 21:38.360] Wow, that's a surprise for this person, [21:38.360 --> 21:42.360] because the next speaker is going to be talking about this entire topic. [21:42.360 --> 21:44.360] Where is the next speaker? [21:44.360 --> 21:46.360] He's over there, and the next speaker here, Remy, [21:46.360 --> 21:49.360] will be talking about this entire topic. [21:49.360 --> 21:51.360] Another question? [21:51.360 --> 21:53.360] Yeah, I was wondering. [21:53.360 --> 21:58.360] So, obviously, the runtime CPU capability detection [21:58.360 --> 22:01.360] and dispatching of the right functions is desirable, [22:01.360 --> 22:04.360] but I don't think it's necessarily contradictory [22:04.360 --> 22:07.360] to having some amount of abstraction. [22:07.360 --> 22:13.360] Like, have you, for instance, looked into the highway library [22:13.360 --> 22:16.360] that is being used in some places [22:16.360 --> 22:19.360] that is trying to provide some kind of abstraction [22:19.360 --> 22:25.360] while still allowing to do runtime dispatch? [22:25.360 --> 22:29.360] So, the question was, have you looked into some of the abstraction libraries [22:29.360 --> 22:33.360] like highway that's trying to do a sort of compromise [22:33.360 --> 22:36.360] between runtime dispatch and abstraction? [22:36.360 --> 22:38.360] I think this question was already answered, [22:38.360 --> 22:40.360] I think, two presentations ago. [22:40.360 --> 22:43.360] Not with highway, but I think with a different SIMD library, [22:43.360 --> 22:45.360] but there have been various approaches, [22:45.360 --> 22:47.360] LibOil, is it SIMD easy? [22:47.360 --> 22:49.360] Various different approaches. [22:49.360 --> 22:53.360] And again, the result from certain FAMPEG-264, [22:53.360 --> 22:56.360] it has been righted by hand. [22:56.360 --> 22:59.360] It's written once, and you know almost certainly [22:59.360 --> 23:02.360] that it's going to be usable for a long time. [23:02.360 --> 23:04.360] I didn't really talk about it, but the abstraction, [23:04.360 --> 23:08.360] there is a lightweight abstraction layer in X-264 and FAMPEG [23:08.360 --> 23:11.360] to try and basically to handle 32-bit, 64-bit, [23:11.360 --> 23:15.360] and to handle other things like the different ABI cores. [23:15.360 --> 23:19.360] The abstraction layer kind of handles [23:19.360 --> 23:22.360] some of the future-proofing in that respect, [23:22.360 --> 23:25.360] but there's a blog post online from Ronald, [23:25.360 --> 23:27.360] if he's here, but he's not here. [23:27.360 --> 23:29.360] He explains some of this. [23:29.360 --> 23:32.360] It's another presentation in itself, unfortunately. [23:32.360 --> 23:38.360] For your benchmark, do you know which optimization [23:38.360 --> 23:41.360] the C-code was compiled with? [23:41.360 --> 23:43.360] The question was, for the benchmark, [23:43.360 --> 23:47.360] what optimizations were the C-code compiled with? [23:47.360 --> 23:54.360] The GCC-03, varying versions of GCC. [23:54.360 --> 23:56.360] In FAMPEG test suite, there's all sorts. [23:56.360 --> 24:00.360] I think from GCC, there's a whole range, [24:00.360 --> 24:05.360] depending on the build OS, but from 4 to 12, I think, [24:05.360 --> 24:07.360] and maybe some people test nightly. [24:07.360 --> 24:09.360] I think Martin certainly tests nightly for ARM. [24:09.360 --> 24:11.360] I don't know if anyone tests nightly on X-86. [24:11.360 --> 24:13.360] Some are LVM as well. [24:13.360 --> 24:16.360] But again, I would be very surprised [24:16.360 --> 24:19.360] if a compiler would be able to come up with something, [24:19.360 --> 24:21.360] because what a human wrote, [24:21.360 --> 24:26.360] because this is involving bit properties of the actual packing, [24:26.360 --> 24:31.360] and actually the trick with PMAD SW is a kind of trick [24:31.360 --> 24:35.360] to try and do a multiply and a zeroing at the same time, [24:35.360 --> 24:38.360] and it probably doesn't have the level of thinking [24:38.360 --> 24:41.360] to understand the bit patterns internally. [24:41.360 --> 24:43.360] Something like chatGPT might one day, [24:43.360 --> 24:46.360] which would be quite interesting, but I don't think the compiler does. [24:46.360 --> 24:48.360] The last question. [24:48.360 --> 24:51.360] I'm just going to follow up on what you said. [24:51.360 --> 24:54.360] If you have a small algorithm, a small function like 10, [24:54.360 --> 24:56.360] 100 clients, maybe, [24:56.360 --> 24:58.360] writing in the assembly might be easy, [24:58.360 --> 25:00.360] but if you have a huge function, [25:00.360 --> 25:04.360] like a filter, a variance filter, or something, a VCT, [25:04.360 --> 25:07.360] writing it directly in the assembly might take a long time. [25:07.360 --> 25:09.360] That's why originally we write it in C, [25:09.360 --> 25:13.360] and then we try to write it in intrinsics. [25:13.360 --> 25:15.360] So the question is, [25:15.360 --> 25:21.360] a longer function might take a longer time to write in assembly [25:21.360 --> 25:24.360] compared to C or intrinsics. [25:24.360 --> 25:28.360] Yes, but there are DCTs and FMPEG, [25:28.360 --> 25:30.360] but they're macroed, right? [25:30.360 --> 25:33.360] Steps have macros to try and help that. [25:33.360 --> 25:35.360] Again, the abstraction layer also adds, I think, macros [25:35.360 --> 25:38.360] on top of what the normal assembler does in terms of macros, [25:38.360 --> 25:40.360] so the blog post explains, [25:40.360 --> 25:42.360] but swap is kind of interesting. [25:42.360 --> 25:44.360] It lets you swap registers, [25:44.360 --> 25:46.360] but then continue with them, [25:46.360 --> 25:48.360] and the layer just handles all of that internally. [25:48.360 --> 25:51.360] There's also just macros for, like, clipping. [25:51.360 --> 25:53.360] I think it was on the example, [25:53.360 --> 25:56.360] but clip is an example. [25:56.360 --> 25:58.360] So clipUB is a macro, [25:58.360 --> 26:00.360] and on the right target set, [26:00.360 --> 26:02.360] it will go and use the right clipping functions [26:02.360 --> 26:04.360] if they're available, for example, [26:04.360 --> 26:06.360] and there's a bunch of these, I think, [26:06.360 --> 26:09.360] that's how to fly. There's a few others like that. [26:09.360 --> 26:38.360] Thank you, Kieran.