[00:00.000 --> 00:14.000] We continue with our next speaker, which is going to be a [00:14.000 --> 00:20.000] V extension. [00:20.000 --> 00:28.000] We continue with our next speaker, which is going to be [00:28.000 --> 00:32.000] kind of a follow-up of the previous one, because it's [00:32.000 --> 00:35.000] approximately the same topic, but this time about wrist drive [00:35.000 --> 00:36.000] and arm. [00:36.000 --> 00:38.000] So please welcome Remi. [00:38.000 --> 00:45.000] Hi, good afternoon, everyone. [00:45.000 --> 00:47.000] I hope you are done with the digestion. [00:47.000 --> 00:51.000] So, yeah, this pretty much follows up our [00:51.000 --> 00:53.000] compliments Karen's previous speech. [00:53.000 --> 00:59.000] But before I go into the details, obviously, I work for big [00:59.000 --> 01:01.000] companies, so I have to put this disclaimer. [01:01.000 --> 01:04.000] And then if I speak too fast or if I don't articulate [01:04.000 --> 01:06.000] properly, please stop. [01:06.000 --> 01:08.000] Please stop me. [01:08.000 --> 01:10.000] With that said, who am I? [01:10.000 --> 01:12.000] I don't think it matters much, but this is my 16th time [01:12.000 --> 01:14.000] in first day, and it's only my first presentation, [01:14.000 --> 01:15.000] so bear with me. [01:15.000 --> 01:18.000] Having said that, I don't work in this field at all, [01:18.000 --> 01:20.000] so just a fancy thing for me to do. [01:20.000 --> 01:23.000] So some history. [01:23.000 --> 01:27.000] So has anybody ever seen this outside the computer museum? [01:27.000 --> 01:28.000] Right. [01:28.000 --> 01:29.000] Yeah, so that's the Cray one. [01:29.000 --> 01:33.000] It's the first, indeed, it's the first vector processor. [01:33.000 --> 01:36.000] It's from the late or second half of the 70s. [01:36.000 --> 01:38.000] I wasn't even born back then. [01:38.000 --> 01:41.000] But in point being, it's the first vector processor, [01:41.000 --> 01:43.000] and we all now, finally, after almost 50 years, [01:43.000 --> 01:46.000] are coming back to this kind of, maybe coming back to this [01:46.000 --> 01:50.000] approach to calculations in computers. [01:50.000 --> 01:53.000] But for people in my generation, this is more what we associate [01:53.000 --> 01:56.000] with SIND before multimedia. [01:56.000 --> 02:00.000] So this is POD, the first video game that actually used [02:00.000 --> 02:06.000] the MMX, which MMX being the first SIND extensions [02:06.000 --> 02:10.000] in the consumer business, in the consumer space, let's say. [02:10.000 --> 02:15.000] So as I said, the MMX came in 1997, and that was 64-bit [02:15.000 --> 02:16.000] vectors. [02:16.000 --> 02:18.000] So you could compute over 64-bit at a time. [02:18.000 --> 02:22.000] Minded, back then, computers was pretty much only 32-bits. [02:22.000 --> 02:25.000] And two years later came SSE, and many, many, many versions [02:25.000 --> 02:26.000] of SSE. [02:26.000 --> 02:30.000] SSE2, which is more popular in multimedia use cases, 2000. [02:30.000 --> 02:32.000] I'm not going to go through all the details of SSE [02:32.000 --> 02:35.000] because there's like a billion, million versions. [02:35.000 --> 02:39.000] And AVX1 came in 2008. [02:39.000 --> 02:41.000] AVX2, which Karen mentioned, came in 2011. [02:41.000 --> 02:47.000] That was the first to have 256-bits vectors. [02:47.000 --> 02:52.000] Then AVX512, which was the topic of the previous presentation, [02:52.000 --> 02:54.000] officially came in 2013. [02:54.000 --> 02:58.000] But as Karen mentioned, the only real, real, proper CPUs [02:58.000 --> 03:02.000] were only out in 2017. [03:02.000 --> 03:06.000] On ARM side, the first SIND was actually 32-bit, [03:06.000 --> 03:11.000] only on ARM V6, 2002. [03:11.000 --> 03:13.000] That doesn't really seem to make sense, but that's because [03:13.000 --> 03:15.000] it's basically calculating as a 4 times 8-bits or 2 times [03:15.000 --> 03:18.000] 16-bit at a time. [03:18.000 --> 03:20.000] Then 128-bits came. [03:20.000 --> 03:22.000] There was no 64-bit SIMD on ARM. [03:22.000 --> 03:29.000] 28-bit came with ARM V7, so Cortex-A8, usually called Neon, [03:29.000 --> 03:32.000] officially called Advanced SIMD in 2005. [03:32.000 --> 03:35.000] And on ARM V8, it's pretty much the same. [03:35.000 --> 03:39.000] Now, it's not actually compatible on X86 or 64-bit, [03:39.000 --> 03:43.000] but it came with basically ARM V8 in 2012. [03:43.000 --> 03:46.000] It's also officially called Advanced SIMD, [03:46.000 --> 03:49.000] and it's also colloquially known as Neon. [03:49.000 --> 03:52.000] As for RISC-V, RISC-V is much more recent. [03:52.000 --> 03:54.000] There is no SIMD. [03:54.000 --> 03:57.000] The problem, and I've only summarized, [03:57.000 --> 03:59.000] this is only a short summary, there's way more extension, [03:59.000 --> 04:02.000] especially on the X86 side, is that every damn time [04:02.000 --> 04:05.000] you have to rewrite your assembly, and as the questions [04:05.000 --> 04:07.000] and answers in the previous talks and even the previous [04:07.000 --> 04:11.000] previous talk covered, this is kind of damn consuming. [04:11.000 --> 04:16.000] So, with that said, this was all fixed size SIMD, [04:16.000 --> 04:19.000] so what about viable length SIMD, [04:19.000 --> 04:22.000] which is what we will be talking about today. [04:22.000 --> 04:24.000] So, how would you go about doing it? [04:24.000 --> 04:27.000] Well, the simple way to do it is to just ask the CPU [04:27.000 --> 04:31.000] what is your vector size, and if you do RISC-V, [04:31.000 --> 04:32.000] this is how you do it. [04:32.000 --> 04:35.000] So, control-register read operation, [04:35.000 --> 04:38.000] the vector is called VL and B for vector lengths in bytes, [04:38.000 --> 04:41.000] and it will store in this case, T0, whatever, [04:41.000 --> 04:44.000] it's one register, the number of bytes in a vector, [04:44.000 --> 04:46.000] and with that you could then iterate. [04:46.000 --> 04:48.000] So, if you want to know the number of elements, [04:48.000 --> 04:52.000] well, you have to do a left shift to compute [04:52.000 --> 04:54.000] the number of elements, so if you want to have [04:54.000 --> 04:58.000] 32-bit elements, you divide by 4, shift by 2 bits. [04:58.000 --> 05:01.000] You could do it like that, and then you would write your main, [05:01.000 --> 05:03.000] you would take your C loop, you would convert it into assembler [05:03.000 --> 05:06.000] to operate on however many elements at a time, [05:06.000 --> 05:08.000] then you would probably unroll to, like, [05:08.000 --> 05:11.000] if you have space in your vector bank, [05:11.000 --> 05:13.000] you'd probably unroll to eliminate, [05:13.000 --> 05:15.000] try to hit up the latency a little bit [05:15.000 --> 05:17.000] because usually between instructions, [05:17.000 --> 05:19.000] if you operate only on one dataset, [05:19.000 --> 05:21.000] you will have inter-instruction latencies [05:21.000 --> 05:23.000] which are going to hurt your performance, [05:23.000 --> 05:26.000] so you typically, in multimedia, unroll twice, [05:26.000 --> 05:29.000] so you will do, work over two sets of vectors [05:29.000 --> 05:31.000] at the same time in parallel, [05:31.000 --> 05:33.000] and when you have done all of that, [05:33.000 --> 05:35.000] you will be working on however many, like, say, [05:35.000 --> 05:38.000] 32-bit, 32 items, 32 elements at a time, [05:38.000 --> 05:40.000] so you have to deal with ages because you might not [05:40.000 --> 05:43.000] have a multiple of 32 elements that you are dealing with. [05:43.000 --> 05:45.000] And that's fine, and that's one way to do it. [05:45.000 --> 05:48.000] In fact, last I checked, that's how Clang, LLVM, [05:48.000 --> 05:50.000] does the three vectorization on risk five [05:50.000 --> 05:52.000] if you enable it, even so you have, [05:52.000 --> 05:54.000] it literally starts by reading the vector lengths [05:54.000 --> 05:57.000] and then deal with ages and unrolls twice, [05:57.000 --> 06:01.000] and if it manages to implement, I mean, [06:01.000 --> 06:03.000] if you have enabled three vectorization [06:03.000 --> 06:05.000] and you have enabled the risk five vectors, [06:05.000 --> 06:09.000] but that's not really how you want to do it. [06:09.000 --> 06:12.000] But before we go on how to actually do it, [06:12.000 --> 06:15.000] what vector lengths are we dealing with here? [06:15.000 --> 06:20.000] So, obviously, well, as mentioned earlier, [06:20.000 --> 06:25.000] common values are 128 and 2,512 bits. [06:25.000 --> 06:30.000] So, both arm and risk five guarantee that even if you have [06:30.000 --> 06:34.000] a viable vector length, it's going to be at least 128 bits, [06:34.000 --> 06:37.000] and it's also going to be a power of two bits, [06:37.000 --> 06:40.000] which is kind of convenient for the calculations. [06:40.000 --> 06:46.000] So, as far as I've seen, there are announcements [06:46.000 --> 06:51.000] for real hardware which would have 256 and 312 bits [06:51.000 --> 06:57.000] that you should be able to buy at some point in the near future. [06:57.000 --> 06:59.000] More crazy stuff. [06:59.000 --> 07:02.000] I've seen actually, like, designs also being announced [07:02.000 --> 07:04.000] with 1,000 bits. [07:04.000 --> 07:06.000] I don't know if they're going to store all those bits [07:06.000 --> 07:09.000] in the physical register bank, [07:09.000 --> 07:11.000] but it would be interesting if it happens. [07:11.000 --> 07:15.000] And I haven't seen theoretical designs at 4,000 bits, [07:15.000 --> 07:19.000] and I mean theoretical to the point that there is a schema, [07:19.000 --> 07:21.000] theoretical in this case, I mean that there are actual [07:21.000 --> 07:24.000] schematics of how you could write a chip [07:24.000 --> 07:26.000] and they have even simulation of the performance [07:26.000 --> 07:29.000] that the chip would get in certain algorithms [07:29.000 --> 07:32.000] as to whether it's actually practically implementable [07:32.000 --> 07:36.000] in an existing industrial process. [07:36.000 --> 07:37.000] I don't know. [07:37.000 --> 07:39.000] I'm not a specialist in electronics, [07:39.000 --> 07:42.000] but that sounds a little bit questionable. [07:42.000 --> 07:45.000] So, in theory, on the syntactic level, [07:45.000 --> 07:47.000] so in the instruction and coding level, [07:47.000 --> 07:50.000] you can have up to two power 16 bits, [07:50.000 --> 07:51.000] at least on S5. [07:51.000 --> 07:54.000] I'm not sure about that, actually. [07:54.000 --> 07:57.000] So, how you actually do vector lengths, [07:57.000 --> 07:59.000] how you're supposed to do a viable vector length, [07:59.000 --> 08:02.000] a SIMD or vector processing, as it's often called, [08:02.000 --> 08:07.000] also, practically vector and SIMD are synonyms. [08:07.000 --> 08:10.000] Well, at first you have to use predication, [08:10.000 --> 08:16.000] which is very highly prevalent in viable vector length scenarios. [08:16.000 --> 08:18.000] Now, it's not a completely new concept. [08:18.000 --> 08:22.000] Kieran mentioned the K-mask in AVX, [08:22.000 --> 08:25.000] so AVX also has predication, [08:25.000 --> 08:30.000] but in viable vector lengths, it's really essential [08:30.000 --> 08:33.000] because this is basically the programming model [08:33.000 --> 08:37.000] on viable vector lengths and or loops [08:37.000 --> 08:39.000] is essentially built on predication. [08:39.000 --> 08:42.000] And that's true both for ARM and RISC-5. [08:42.000 --> 08:45.000] So, a predicate is a vector of Boolean. [08:45.000 --> 08:47.000] So, like the K-mask in X86, [08:47.000 --> 08:52.000] it's called the p-vector in ARM, [08:52.000 --> 08:56.000] and it's the mask vector in RISC-5. [08:56.000 --> 08:58.000] And as Kieran said, kind of repeating, [08:58.000 --> 09:03.000] but it just specifies which of the elements in your vector, [09:03.000 --> 09:05.000] it specifies which ones you will be loading [09:05.000 --> 09:08.000] or modifying or storing out of a given instruction. [09:08.000 --> 09:09.000] So, if it's a load instruction, [09:09.000 --> 09:13.000] which values you load for memory and overwrite into the register, [09:13.000 --> 09:14.000] if it's a stored instruction, [09:14.000 --> 09:15.000] it's going to be the other way, [09:15.000 --> 09:17.000] which values in memory are going to overwrite [09:17.000 --> 09:19.000] versus which ones are going to live as they are. [09:19.000 --> 09:21.000] And if it's a calculation instruction, [09:21.000 --> 09:25.000] vector to vector, then it's going to affect which ones, [09:25.000 --> 09:27.000] which results are actually stored into the register [09:27.000 --> 09:30.000] versus which ones are just discarded. [09:30.000 --> 09:34.000] So, on ARM-V9 or SVE, [09:34.000 --> 09:37.000] one way you would typically do now your SVE loop [09:37.000 --> 09:39.000] instead of, say, your NEON loop, [09:39.000 --> 09:42.000] is you would start by counting down, [09:42.000 --> 09:45.000] you would initialize, say, extend to a zero, [09:45.000 --> 09:47.000] and then you would... [09:47.000 --> 09:49.000] So, you have to imagine here [09:49.000 --> 09:53.000] that you have your actual NEON or SVE loop, [09:53.000 --> 09:54.000] so you will check... [09:54.000 --> 09:55.000] You have this funny instruction, [09:55.000 --> 09:58.000] which is actually called YLT or YLLO, [09:58.000 --> 10:00.000] and you initialize the p-zero vector in this case, [10:00.000 --> 10:03.000] which is one of the predicate registers [10:03.000 --> 10:07.000] to say that, essentially, [10:07.000 --> 10:09.000] you want to count how many elements you still have [10:09.000 --> 10:10.000] in your input data. [10:10.000 --> 10:11.000] So, here, we have... [10:11.000 --> 10:13.000] We imagine that X0 is the number of elements [10:13.000 --> 10:15.000] we have been given to this function. [10:15.000 --> 10:17.000] X10 is the count of how far we've been, [10:17.000 --> 10:20.000] so it's our iterator. [10:20.000 --> 10:21.000] And we'll say... [10:21.000 --> 10:22.000] Essentially, what we'll say is, [10:22.000 --> 10:25.000] as long as X10 is larger... [10:25.000 --> 10:27.000] As long as the number of elements we still have... [10:27.000 --> 10:30.000] So, as long as X0 is larger than the size [10:30.000 --> 10:32.000] of the vector that the CPU can handle, [10:32.000 --> 10:35.000] we'll just set the predicate to handle to be clear, [10:35.000 --> 10:39.000] so we'll use the full size of the vector for our programming. [10:39.000 --> 10:43.000] And once the number of elements is more than zero, [10:43.000 --> 10:45.000] but strictly less than the vector size [10:45.000 --> 10:46.000] than the CPU can handle, [10:46.000 --> 10:49.000] then we'll start basically just ignoring the values [10:49.000 --> 10:50.000] at the end of the vector, [10:50.000 --> 10:51.000] so we'll have a bunch of ones, [10:51.000 --> 10:53.000] and then at the end, a bunch of zeros. [10:53.000 --> 10:55.000] And this is how you abstract away and hide away [10:55.000 --> 10:59.000] the complexity of dealing with the edge in your loop. [10:59.000 --> 11:01.000] So, essentially, by doing this, [11:01.000 --> 11:03.000] you don't care what is the actual capacity of... [11:03.000 --> 11:05.000] You don't actually need, at any point, to know [11:05.000 --> 11:07.000] how many elements you're dealing with [11:07.000 --> 11:08.000] in any iteration of your loop, [11:08.000 --> 11:11.000] because it's all hidden away by the... [11:11.000 --> 11:12.000] Essentially, the size of the vector [11:12.000 --> 11:13.000] and the size of the predicate are matched, [11:13.000 --> 11:14.000] so you don't actually care. [11:14.000 --> 11:16.000] And you also don't need to deal with edges, [11:16.000 --> 11:18.000] because, well, even if there's one or two or three [11:18.000 --> 11:20.000] or four elements left over at the end, [11:20.000 --> 11:23.000] you can just deal with them in the last iteration, [11:23.000 --> 11:25.000] which, of course, will be a little bit less efficient [11:25.000 --> 11:28.000] than using the full size of the vector, [11:28.000 --> 11:30.000] but it's still much faster than having a separate edge [11:30.000 --> 11:32.000] if only because you will not be stressing [11:32.000 --> 11:36.000] the instruction cache of the CPU. [11:36.000 --> 11:39.000] So that's predication. [11:39.000 --> 11:40.000] Now, unrolling. [11:40.000 --> 11:41.000] If you thought about it, [11:41.000 --> 11:43.000] all that I just said with predication, [11:43.000 --> 11:45.000] it doesn't really work with unrolling, [11:45.000 --> 11:47.000] because now you've counted down... [11:47.000 --> 11:49.000] You've set your predicate vector to count down [11:49.000 --> 11:51.000] how many elements you have still [11:51.000 --> 11:54.000] in your total count of elements. [11:54.000 --> 11:56.000] You can't unroll, because now, like, [11:56.000 --> 11:58.000] you've said, oh, I have 10 elements left, [11:58.000 --> 12:00.000] I'm going to use 10 elements in my vector, [12:00.000 --> 12:02.000] but if you have... [12:02.000 --> 12:03.000] It just doesn't work, like, [12:03.000 --> 12:05.000] because if you had, like, one and a half vector left, [12:05.000 --> 12:08.000] you would want to have one predicate with all the bit set [12:08.000 --> 12:10.000] and another predicate with half of the bit set. [12:10.000 --> 12:12.000] This doesn't really work very well. [12:12.000 --> 12:15.000] And, yes, now, it's a bit of a hot tech. [12:15.000 --> 12:17.000] Maybe I will never be ever again allowed [12:17.000 --> 12:19.000] to write a session-peck code after this, [12:19.000 --> 12:24.000] but just don't unroll if you use variable vector lengths. [12:24.000 --> 12:26.000] Now, there may be cases where you can unroll [12:26.000 --> 12:30.000] because, naturally, you have some kind of parallel [12:30.000 --> 12:33.000] in your design aspect in your algorithm, [12:33.000 --> 12:37.000] but the idea of vector processing [12:37.000 --> 12:41.000] is that we have higher latency and larger vectors, [12:41.000 --> 12:44.000] which, in the end, result in higher throughput, [12:44.000 --> 12:47.000] and we shouldn't need to unroll. [12:47.000 --> 12:51.000] I'm sure you will find actual designs real hardware, [12:51.000 --> 12:53.000] real processors, where it will be faster if you do unroll, [12:53.000 --> 12:59.000] and how much you need to unroll will depend on the design. [12:59.000 --> 13:02.000] And, of course, if you are trying to squeeze the very last [13:02.000 --> 13:06.000] MIPS out of a given specific piece of hardware, [13:06.000 --> 13:08.000] then maybe you should unroll, [13:08.000 --> 13:10.000] but, I think, generally speaking, [13:10.000 --> 13:13.000] at least you shouldn't start by unrolling. [13:13.000 --> 13:15.000] And another interesting thing to keep in mind, [13:15.000 --> 13:19.000] which kind of already mentioned in the previous slide, [13:19.000 --> 13:22.000] is that you don't have alignment issues. [13:22.000 --> 13:24.000] The one common problem with CMD instruction set [13:24.000 --> 13:26.000] is that the load and store instructions [13:26.000 --> 13:28.000] require overaligned data, [13:28.000 --> 13:30.000] typically aligned on the side of the vector, [13:30.000 --> 13:32.000] which is very inconvenient when you're operating [13:32.000 --> 13:36.000] from C or C++ code, because it's usually C or C++ allocator [13:36.000 --> 13:40.000] will only allocate align on whatever the ABI specifies, [13:40.000 --> 13:42.000] which on RBA, it would be 16 bytes for the stack [13:42.000 --> 13:46.000] and 8 bytes for the heap. [13:46.000 --> 13:50.000] So, usually, while at least both SV and RIC5 vectors, [13:50.000 --> 13:53.000] the alignment needed is only the alignment of the element, [13:53.000 --> 13:56.000] and it's not the alignment, it's not the side of the vector. [13:56.000 --> 14:01.000] So, if you are operating on, say, 4 bytes pieces of data elements, [14:01.000 --> 14:04.000] then you only need your vectors to be aligned on 4 bytes, [14:04.000 --> 14:06.000] which is a very nice property for dealing, [14:06.000 --> 14:09.000] especially on the edge cases, [14:09.000 --> 14:11.000] and also you don't have to deal with, [14:11.000 --> 14:15.000] like, if you have one input that is perfectly aligned [14:15.000 --> 14:17.000] and the output is not perfectly aligned, [14:17.000 --> 14:20.000] like, you end up having this weird mismatch [14:20.000 --> 14:22.000] and you end up having to deal with different edge cases, [14:22.000 --> 14:23.000] it's really a mess. [14:23.000 --> 14:25.000] With vector processing, you don't do that, [14:25.000 --> 14:29.000] so you don't actually have to worry about it. [14:29.000 --> 14:32.000] So, with that, we've covered generality, [14:32.000 --> 14:34.000] so how is it looking on ARM side, [14:34.000 --> 14:37.000] and then we'll see RIC5 side, because it's a bit weird [14:37.000 --> 14:38.000] if I would... [14:38.000 --> 14:40.000] I thought, like, to put everything together, [14:40.000 --> 14:41.000] but then it becomes a huge mess. [14:41.000 --> 14:43.000] So, it's going to be a bit repetitive, [14:43.000 --> 14:45.000] because, of course, there's a lot of similarities, [14:45.000 --> 14:50.000] so SVE came about, like, five years ago, [14:50.000 --> 14:52.000] a little bit more than five years ago, [14:52.000 --> 14:55.000] I think it was announced late 2016, [14:55.000 --> 14:56.000] if I recall correctly. [14:56.000 --> 14:59.000] It was pretty much less on multimedia. [14:59.000 --> 15:01.000] It was explicitly meant for other things, [15:01.000 --> 15:04.000] like, well, scientific applications, [15:04.000 --> 15:09.000] or engineering modeling and this kind of stuff, [15:09.000 --> 15:14.000] well, HPC, and so nobody used it. [15:14.000 --> 15:16.000] At least nobody in this room used it. [15:16.000 --> 15:19.000] This was fixed with SVE2, [15:19.000 --> 15:22.000] which is sometimes called ARMv9, [15:22.000 --> 15:25.000] because it kind of comes with ARMv9, [15:25.000 --> 15:29.000] but it's really called SVE2. [15:29.000 --> 15:31.000] Fixed that issue, the realisation that, actually, [15:31.000 --> 15:32.000] this is a good idea. [15:32.000 --> 15:34.000] This pattern programming model is also interesting [15:34.000 --> 15:36.000] for multimedia and crypto, [15:36.000 --> 15:40.000] which was also missing from SVE1. [15:40.000 --> 15:42.000] And so what they did is they just took, [15:42.000 --> 15:44.000] so which neomonics are missing, [15:44.000 --> 15:47.000] and added those, [15:47.000 --> 15:49.000] and it's pretty much the same mnemonics [15:49.000 --> 15:52.000] you just add the predicate register. [15:52.000 --> 15:54.000] That's why this is, of course, a little bit more complicated, [15:54.000 --> 15:57.000] but as I mentioned, you just use a while instruction, [15:57.000 --> 15:59.000] which will then provision your predicate, [15:59.000 --> 16:01.000] and you have to pick the element size [16:01.000 --> 16:03.000] so that, of course, this adds up correctly, [16:03.000 --> 16:05.000] and then you have a new set of branch conditions, [16:05.000 --> 16:11.000] so first element, last element, and so on and so forth. [16:11.000 --> 16:14.000] So the remaining elements will be determined [16:14.000 --> 16:16.000] by the predicate register, [16:16.000 --> 16:18.000] and the predicate register will set the condition flag, [16:18.000 --> 16:21.000] and the while instruction will also subtract. [16:21.000 --> 16:23.000] There is a count, the number of processed elements [16:23.000 --> 16:26.000] from your output register. [16:26.000 --> 16:30.000] And yeah, at this point, stop pretending that I'm at risk. [16:30.000 --> 16:31.000] How do you detect this stuff? [16:31.000 --> 16:33.000] So there's a processor macro, [16:33.000 --> 16:35.000] otherwise, as usual, on ARMv8, [16:35.000 --> 16:37.000] you have a bunch of privileged registers [16:37.000 --> 16:39.000] for the OS to look at, [16:39.000 --> 16:41.000] and then you have also Linux, [16:41.000 --> 16:44.000] you have a bunch of flags in the auxiliary vector bit, [16:44.000 --> 16:46.000] so it's all classic. [16:46.000 --> 16:49.000] Another OS that you're out of luck. [16:49.000 --> 16:52.000] Availability, so as we said, 2016, [16:52.000 --> 16:54.000] but it didn't really work for us. [16:54.000 --> 16:56.000] SV2 was specified in 2019, [16:56.000 --> 17:00.000] but so the real hardware came earlier last year, [17:00.000 --> 17:03.000] so Cortex-AX2 and all the other things [17:03.000 --> 17:07.000] from dynamic IQ 110. [17:07.000 --> 17:10.000] So Samsung actually knows 2,200, [17:10.000 --> 17:12.000] and so Cortex-AX2 and all the other things, [17:12.000 --> 17:14.000] they do have SVE, [17:14.000 --> 17:17.000] unfortunately, it's only 128-bit vectors, [17:17.000 --> 17:19.000] and it's pretty damn expensive, [17:19.000 --> 17:22.000] but if you want to do it, you can find the hardware. [17:22.000 --> 17:25.000] So RIS5, it's a different model. [17:25.000 --> 17:26.000] Can I add? [17:26.000 --> 17:27.000] Yeah. [17:27.000 --> 17:30.000] There's also the Alibaba one, the Yi-Tian. [17:30.000 --> 17:31.000] Yeah, maybe. [17:31.000 --> 17:32.000] It's possible, yes. [17:32.000 --> 17:35.000] It's only available in China, but it's available. [17:35.000 --> 17:38.000] So on RIS5, the predication is a little bit different, [17:38.000 --> 17:42.000] so they have separation between element count [17:42.000 --> 17:44.000] and the actual predicate. [17:44.000 --> 17:46.000] And so in practice in multimedia, maybe not in David, [17:46.000 --> 17:48.000] but usually you don't use the predicate at all, [17:48.000 --> 17:51.000] so we will instead just count the elements. [17:51.000 --> 17:53.000] This is the instruction you always find [17:53.000 --> 17:56.000] at the beginning of the loop, which considers the vectors. [17:56.000 --> 17:59.000] So in this case, what we say is that we have [17:59.000 --> 18:01.000] a certain number of input elements. [18:01.000 --> 18:05.000] We want to get the number of output parameters [18:05.000 --> 18:08.000] and the number of elements the CPU will deal with [18:08.000 --> 18:10.000] in the iteration. [18:10.000 --> 18:12.000] We then have to say the size of the element in bits, [18:12.000 --> 18:15.000] in this case, for instance, 16 bits. [18:15.000 --> 18:17.000] The group size, which is kind of free unrolling, [18:17.000 --> 18:19.000] it will automatically, if you set it to 2, [18:19.000 --> 18:22.000] it will use all the, and you say you want to use [18:22.000 --> 18:24.000] vector 8, it will use vector 8 and vector 9 [18:24.000 --> 18:26.000] at the same time, for instance. [18:26.000 --> 18:29.000] And tail mode, we always set it to agnostic [18:29.000 --> 18:31.000] because we don't really care about tail mode [18:31.000 --> 18:33.000] and mask mode, we also always set it to agnostic. [18:33.000 --> 18:35.000] There might be use cases where you need to do something else, [18:35.000 --> 18:39.000] which might be a little bit slower, but usually you don't. [18:39.000 --> 18:41.000] This is about how to deal with the stuff that is masked [18:41.000 --> 18:43.000] or with the element that are at the end of the vector [18:43.000 --> 18:44.000] which we don't care about. [18:44.000 --> 18:45.000] Usually you don't care about them, [18:45.000 --> 18:48.000] so you just tell the CPU you don't care about them. [18:48.000 --> 18:50.000] One cool thing about RISC-V, [18:50.000 --> 18:52.000] the floating point registers are separate from the vectors [18:52.000 --> 18:55.000] and like on ARM, so you have more registers available [18:55.000 --> 18:58.000] if you have hybrid calculations between scalar and vector side. [18:58.000 --> 19:00.000] But do mind the floating point convention, [19:00.000 --> 19:03.000] calling convention when this happens, [19:03.000 --> 19:06.000] otherwise you will screw up your register state [19:06.000 --> 19:09.000] and confuse your CPU. [19:09.000 --> 19:12.000] The interesting stuff also about RISC-V, [19:12.000 --> 19:13.000] they have segmented load and store, [19:13.000 --> 19:16.000] which is similar to structured load and store in ARM, [19:16.000 --> 19:19.000] but they can do it up to 8 structures, [19:19.000 --> 19:22.000] whereas ARM is only up to 4. [19:22.000 --> 19:26.000] What is much more interesting perhaps is [19:26.000 --> 19:29.000] strided loads and store where you can say, [19:29.000 --> 19:32.000] I have this register X which contains a value [19:32.000 --> 19:33.000] and that's going to be my stride. [19:33.000 --> 19:35.000] So for instance with that you can put [19:35.000 --> 19:37.000] the width of your video inside one register [19:37.000 --> 19:41.000] and you can load all the pixels in a column in an instruction [19:41.000 --> 19:44.000] without having to do weird shuffling and whatever. [19:44.000 --> 19:46.000] Does that actually perform a practice? [19:46.000 --> 19:48.000] I think that's going to depend on the design, [19:48.000 --> 19:51.000] but normally it should be in the data cache [19:51.000 --> 19:53.000] which should be okay. [19:53.000 --> 20:00.000] So I'll come to that. [20:00.000 --> 20:04.000] Yes, on the downside you don't have transposer [20:04.000 --> 20:07.000] or zipping instructions, which should be annoying, [20:07.000 --> 20:08.000] which is kind of the same, [20:08.000 --> 20:10.000] so you have to replace it with strides. [20:10.000 --> 20:13.000] So it's fine if you want to take every second element [20:13.000 --> 20:16.000] from one vector and so on. [20:16.000 --> 20:21.000] Feature detection, they have very, very detailed [20:21.000 --> 20:25.000] pre-processor feature flags. [20:25.000 --> 20:27.000] I mean you can download the slides if you're interested. [20:27.000 --> 20:30.000] On the other hand, on runtime detection it's pretty crappy. [20:30.000 --> 20:33.000] You have to trust the device tree node. [20:33.000 --> 20:36.000] So you have to trust the boot loader to actually tell it [20:36.000 --> 20:40.000] to your OS correctly in the device tree data structure [20:40.000 --> 20:42.000] and otherwise there is a flag in there. [20:42.000 --> 20:46.000] So the V, the Vth bit, so the 21, [20:46.000 --> 20:48.000] because V is the 22nd later in the alphabet [20:48.000 --> 20:51.000] is a vector flag in the auxiliary vector [20:51.000 --> 20:53.000] for hardware capabilities on Linux. [20:53.000 --> 20:58.000] Availability, unfortunately at this time there is no hardware. [20:58.000 --> 21:02.000] Ali Baba, sorry, T-Head has made hardware available [21:02.000 --> 21:06.000] but it's implementing version 0.71 [21:06.000 --> 21:11.000] from about 18 months before the standardised specification [21:11.000 --> 21:14.000] which is implemented by Clang and GCC. [21:14.000 --> 21:16.000] So you can kind of work with that [21:16.000 --> 21:18.000] and it gives you some idea of the performance [21:18.000 --> 21:20.000] but you're going to have to rewrite stuff [21:20.000 --> 21:22.000] because it's not completely bit compatible [21:22.000 --> 21:24.000] so it's kind of annoying. [21:24.000 --> 21:26.000] I don't know when the stuff is going to happen. [21:26.000 --> 21:27.000] I'm pretty sure it's going to happen [21:27.000 --> 21:30.000] but I would guess by the end of this year [21:30.000 --> 21:35.000] we are going to see hardware available. [21:35.000 --> 21:38.000] Also I think one kind of not answering [21:38.000 --> 21:39.000] or dodging the previous question [21:39.000 --> 21:42.000] but because we have so many different vendors on RISC 5 [21:42.000 --> 21:44.000] and I think there's more than I did. [21:44.000 --> 21:48.000] I only listed three here but I think there's other. [21:48.000 --> 21:50.000] There might be big difference [21:50.000 --> 21:51.000] between the performance characteristics [21:51.000 --> 21:53.000] of the different vendors. [21:53.000 --> 21:55.000] These are our references. [22:01.000 --> 22:06.000] Yes, I have just a few questions. [22:06.000 --> 22:11.000] Have you heard of the SVP64 project from Lever SoC yet [22:11.000 --> 22:17.000] which is a kind of similar vector approach for PowerPC? [22:17.000 --> 22:19.000] No, I haven't looked at PowerPC at all. [22:19.000 --> 22:23.000] Another question that I had with my own CIDD programming workers [22:23.000 --> 22:26.000] we often have applications that are inherently horizontal. [22:26.000 --> 22:30.000] For example, let's say you are writing a vectorized string search operation [22:30.000 --> 22:32.000] or you're doing something like decoding JPEGs [22:32.000 --> 22:34.000] where you have these 8.8 blocks [22:34.000 --> 22:37.000] where you want to do some sort of close-in transform on them [22:37.000 --> 22:38.000] and they have this fixed size [22:38.000 --> 22:40.000] and depending on the vector size [22:40.000 --> 22:41.000] you want to break them up [22:41.000 --> 22:44.000] or you maybe have to process multiple of them at the same time. [22:44.000 --> 22:46.000] Is there an intelligent way to solve this? [22:46.000 --> 22:48.000] I've had this case. [22:48.000 --> 22:51.000] The question is when you have a naturally fixed size [22:51.000 --> 22:55.000] input kind of block that you want to process at the time [22:55.000 --> 22:56.000] how do you do this? [22:56.000 --> 22:59.000] Because then you actually want to have a fixed size vector [22:59.000 --> 23:01.000] in effect, paraphrasing the question. [23:01.000 --> 23:04.000] I've had this case with the SVP64 a couple of times. [23:04.000 --> 23:07.000] One way is to just check that the vector size of the CPU [23:07.000 --> 23:09.000] is big enough and just do one at a time. [23:09.000 --> 23:11.000] If you can, try to do it at a time [23:11.000 --> 23:13.000] because it's always going to be a power of 2 [23:13.000 --> 23:16.000] so you should be able relatively easily to parallelize. [23:16.000 --> 23:18.000] Obviously the ideal situation is to parallelize. [23:18.000 --> 23:21.000] What you will have a problem is if your dataset is larger than the vector [23:21.000 --> 23:23.000] then it's going to become complicated for you. [23:23.000 --> 23:28.000] On RISC-5 you can deal with this with the group multiplier [23:28.000 --> 23:32.000] which will allow you to use multiple vectors as a single vector. [23:32.000 --> 23:38.000] And the last question I have is how do you realistically test [23:38.000 --> 23:40.000] vectorized triangles? [23:40.000 --> 23:43.000] When the hardware you have only supports one vector length at most [23:43.000 --> 23:46.000] so you have to probably use some sort of relation to set up for this? [23:46.000 --> 23:49.000] Most of the loops will not depend. [23:49.000 --> 23:51.000] So the question is how do you test a different vector size [23:51.000 --> 23:53.000] for validation I guess. [23:53.000 --> 23:56.000] Most of the loops don't really care about the vector size [23:56.000 --> 23:59.000] because if you have a simple case where you follow the simple pattern [23:59.000 --> 24:01.000] it doesn't really care what the vector size is [24:01.000 --> 24:04.000] except for benchmarking of course and you have a problem. [24:04.000 --> 24:09.000] Otherwise QMU and Spark at least for RISC-5 support any vector size [24:09.000 --> 24:12.000] to give that as long as it's a valid one from specification plan point. [24:12.000 --> 24:14.000] Do you realistically really test for that? [24:14.000 --> 24:17.000] Or do you just say it's simply not going to be a problem? [24:17.000 --> 24:22.000] I mean personally when I've had the situation where I had a fixed size input [24:22.000 --> 24:25.000] and I had to test with different vector size and I tested with different vector size [24:25.000 --> 24:28.000] in most cases you don't actually care. [24:28.000 --> 24:31.000] I mean then it's a matter of choice of you do your testing [24:31.000 --> 24:35.000] and no strict you want to be with the validation I think. [24:35.000 --> 24:36.000] Thank you. [24:36.000 --> 24:38.000] We have no one on question now? [24:38.000 --> 24:43.000] Firstly disclaimer, I'm related to the Leversock project with SB64. [24:43.000 --> 24:46.000] It's similar to RISC-5 vectors but not exactly the same [24:46.000 --> 24:49.000] but they share a lot of the common ideas. [24:49.000 --> 24:53.000] You mentioned a very good point that CMD is not vector processing. [24:53.000 --> 24:58.000] In order we had to try to report some code from Neon to SV2 [24:58.000 --> 25:02.000] and it was less than suboptimal let's say. [25:02.000 --> 25:31.000] We had to revert back to the original C algorithm.