[00:00.000 --> 00:14.000]  We continue with our next speaker, which is going to be a
[00:14.000 --> 00:20.000]  V extension.
[00:20.000 --> 00:28.000]  We continue with our next speaker, which is going to be
[00:28.000 --> 00:32.000]  kind of a follow-up of the previous one, because it's
[00:32.000 --> 00:35.000]  approximately the same topic, but this time about wrist drive
[00:35.000 --> 00:36.000]  and arm.
[00:36.000 --> 00:38.000]  So please welcome Remi.
[00:38.000 --> 00:45.000]  Hi, good afternoon, everyone.
[00:45.000 --> 00:47.000]  I hope you are done with the digestion.
[00:47.000 --> 00:51.000]  So, yeah, this pretty much follows up our
[00:51.000 --> 00:53.000]  compliments Karen's previous speech.
[00:53.000 --> 00:59.000]  But before I go into the details, obviously, I work for big
[00:59.000 --> 01:01.000]  companies, so I have to put this disclaimer.
[01:01.000 --> 01:04.000]  And then if I speak too fast or if I don't articulate
[01:04.000 --> 01:06.000]  properly, please stop.
[01:06.000 --> 01:08.000]  Please stop me.
[01:08.000 --> 01:10.000]  With that said, who am I?
[01:10.000 --> 01:12.000]  I don't think it matters much, but this is my 16th time
[01:12.000 --> 01:14.000]  in first day, and it's only my first presentation,
[01:14.000 --> 01:15.000]  so bear with me.
[01:15.000 --> 01:18.000]  Having said that, I don't work in this field at all,
[01:18.000 --> 01:20.000]  so just a fancy thing for me to do.
[01:20.000 --> 01:23.000]  So some history.
[01:23.000 --> 01:27.000]  So has anybody ever seen this outside the computer museum?
[01:27.000 --> 01:28.000]  Right.
[01:28.000 --> 01:29.000]  Yeah, so that's the Cray one.
[01:29.000 --> 01:33.000]  It's the first, indeed, it's the first vector processor.
[01:33.000 --> 01:36.000]  It's from the late or second half of the 70s.
[01:36.000 --> 01:38.000]  I wasn't even born back then.
[01:38.000 --> 01:41.000]  But in point being, it's the first vector processor,
[01:41.000 --> 01:43.000]  and we all now, finally, after almost 50 years,
[01:43.000 --> 01:46.000]  are coming back to this kind of, maybe coming back to this
[01:46.000 --> 01:50.000]  approach to calculations in computers.
[01:50.000 --> 01:53.000]  But for people in my generation, this is more what we associate
[01:53.000 --> 01:56.000]  with SIND before multimedia.
[01:56.000 --> 02:00.000]  So this is POD, the first video game that actually used
[02:00.000 --> 02:06.000]  the MMX, which MMX being the first SIND extensions
[02:06.000 --> 02:10.000]  in the consumer business, in the consumer space, let's say.
[02:10.000 --> 02:15.000]  So as I said, the MMX came in 1997, and that was 64-bit
[02:15.000 --> 02:16.000]  vectors.
[02:16.000 --> 02:18.000]  So you could compute over 64-bit at a time.
[02:18.000 --> 02:22.000]  Minded, back then, computers was pretty much only 32-bits.
[02:22.000 --> 02:25.000]  And two years later came SSE, and many, many, many versions
[02:25.000 --> 02:26.000]  of SSE.
[02:26.000 --> 02:30.000]  SSE2, which is more popular in multimedia use cases, 2000.
[02:30.000 --> 02:32.000]  I'm not going to go through all the details of SSE
[02:32.000 --> 02:35.000]  because there's like a billion, million versions.
[02:35.000 --> 02:39.000]  And AVX1 came in 2008.
[02:39.000 --> 02:41.000]  AVX2, which Karen mentioned, came in 2011.
[02:41.000 --> 02:47.000]  That was the first to have 256-bits vectors.
[02:47.000 --> 02:52.000]  Then AVX512, which was the topic of the previous presentation,
[02:52.000 --> 02:54.000]  officially came in 2013.
[02:54.000 --> 02:58.000]  But as Karen mentioned, the only real, real, proper CPUs
[02:58.000 --> 03:02.000]  were only out in 2017.
[03:02.000 --> 03:06.000]  On ARM side, the first SIND was actually 32-bit,
[03:06.000 --> 03:11.000]  only on ARM V6, 2002.
[03:11.000 --> 03:13.000]  That doesn't really seem to make sense, but that's because
[03:13.000 --> 03:15.000]  it's basically calculating as a 4 times 8-bits or 2 times
[03:15.000 --> 03:18.000]  16-bit at a time.
[03:18.000 --> 03:20.000]  Then 128-bits came.
[03:20.000 --> 03:22.000]  There was no 64-bit SIMD on ARM.
[03:22.000 --> 03:29.000]  28-bit came with ARM V7, so Cortex-A8, usually called Neon,
[03:29.000 --> 03:32.000]  officially called Advanced SIMD in 2005.
[03:32.000 --> 03:35.000]  And on ARM V8, it's pretty much the same.
[03:35.000 --> 03:39.000]  Now, it's not actually compatible on X86 or 64-bit,
[03:39.000 --> 03:43.000]  but it came with basically ARM V8 in 2012.
[03:43.000 --> 03:46.000]  It's also officially called Advanced SIMD,
[03:46.000 --> 03:49.000]  and it's also colloquially known as Neon.
[03:49.000 --> 03:52.000]  As for RISC-V, RISC-V is much more recent.
[03:52.000 --> 03:54.000]  There is no SIMD.
[03:54.000 --> 03:57.000]  The problem, and I've only summarized,
[03:57.000 --> 03:59.000]  this is only a short summary, there's way more extension,
[03:59.000 --> 04:02.000]  especially on the X86 side, is that every damn time
[04:02.000 --> 04:05.000]  you have to rewrite your assembly, and as the questions
[04:05.000 --> 04:07.000]  and answers in the previous talks and even the previous
[04:07.000 --> 04:11.000]  previous talk covered, this is kind of damn consuming.
[04:11.000 --> 04:16.000]  So, with that said, this was all fixed size SIMD,
[04:16.000 --> 04:19.000]  so what about viable length SIMD,
[04:19.000 --> 04:22.000]  which is what we will be talking about today.
[04:22.000 --> 04:24.000]  So, how would you go about doing it?
[04:24.000 --> 04:27.000]  Well, the simple way to do it is to just ask the CPU
[04:27.000 --> 04:31.000]  what is your vector size, and if you do RISC-V,
[04:31.000 --> 04:32.000]  this is how you do it.
[04:32.000 --> 04:35.000]  So, control-register read operation,
[04:35.000 --> 04:38.000]  the vector is called VL and B for vector lengths in bytes,
[04:38.000 --> 04:41.000]  and it will store in this case, T0, whatever,
[04:41.000 --> 04:44.000]  it's one register, the number of bytes in a vector,
[04:44.000 --> 04:46.000]  and with that you could then iterate.
[04:46.000 --> 04:48.000]  So, if you want to know the number of elements,
[04:48.000 --> 04:52.000]  well, you have to do a left shift to compute
[04:52.000 --> 04:54.000]  the number of elements, so if you want to have
[04:54.000 --> 04:58.000]  32-bit elements, you divide by 4, shift by 2 bits.
[04:58.000 --> 05:01.000]  You could do it like that, and then you would write your main,
[05:01.000 --> 05:03.000]  you would take your C loop, you would convert it into assembler
[05:03.000 --> 05:06.000]  to operate on however many elements at a time,
[05:06.000 --> 05:08.000]  then you would probably unroll to, like,
[05:08.000 --> 05:11.000]  if you have space in your vector bank,
[05:11.000 --> 05:13.000]  you'd probably unroll to eliminate,
[05:13.000 --> 05:15.000]  try to hit up the latency a little bit
[05:15.000 --> 05:17.000]  because usually between instructions,
[05:17.000 --> 05:19.000]  if you operate only on one dataset,
[05:19.000 --> 05:21.000]  you will have inter-instruction latencies
[05:21.000 --> 05:23.000]  which are going to hurt your performance,
[05:23.000 --> 05:26.000]  so you typically, in multimedia, unroll twice,
[05:26.000 --> 05:29.000]  so you will do, work over two sets of vectors
[05:29.000 --> 05:31.000]  at the same time in parallel,
[05:31.000 --> 05:33.000]  and when you have done all of that,
[05:33.000 --> 05:35.000]  you will be working on however many, like, say,
[05:35.000 --> 05:38.000]  32-bit, 32 items, 32 elements at a time,
[05:38.000 --> 05:40.000]  so you have to deal with ages because you might not
[05:40.000 --> 05:43.000]  have a multiple of 32 elements that you are dealing with.
[05:43.000 --> 05:45.000]  And that's fine, and that's one way to do it.
[05:45.000 --> 05:48.000]  In fact, last I checked, that's how Clang, LLVM,
[05:48.000 --> 05:50.000]  does the three vectorization on risk five
[05:50.000 --> 05:52.000]  if you enable it, even so you have,
[05:52.000 --> 05:54.000]  it literally starts by reading the vector lengths
[05:54.000 --> 05:57.000]  and then deal with ages and unrolls twice,
[05:57.000 --> 06:01.000]  and if it manages to implement, I mean,
[06:01.000 --> 06:03.000]  if you have enabled three vectorization
[06:03.000 --> 06:05.000]  and you have enabled the risk five vectors,
[06:05.000 --> 06:09.000]  but that's not really how you want to do it.
[06:09.000 --> 06:12.000]  But before we go on how to actually do it,
[06:12.000 --> 06:15.000]  what vector lengths are we dealing with here?
[06:15.000 --> 06:20.000]  So, obviously, well, as mentioned earlier,
[06:20.000 --> 06:25.000]  common values are 128 and 2,512 bits.
[06:25.000 --> 06:30.000]  So, both arm and risk five guarantee that even if you have
[06:30.000 --> 06:34.000]  a viable vector length, it's going to be at least 128 bits,
[06:34.000 --> 06:37.000]  and it's also going to be a power of two bits,
[06:37.000 --> 06:40.000]  which is kind of convenient for the calculations.
[06:40.000 --> 06:46.000]  So, as far as I've seen, there are announcements
[06:46.000 --> 06:51.000]  for real hardware which would have 256 and 312 bits
[06:51.000 --> 06:57.000]  that you should be able to buy at some point in the near future.
[06:57.000 --> 06:59.000]  More crazy stuff.
[06:59.000 --> 07:02.000]  I've seen actually, like, designs also being announced
[07:02.000 --> 07:04.000]  with 1,000 bits.
[07:04.000 --> 07:06.000]  I don't know if they're going to store all those bits
[07:06.000 --> 07:09.000]  in the physical register bank,
[07:09.000 --> 07:11.000]  but it would be interesting if it happens.
[07:11.000 --> 07:15.000]  And I haven't seen theoretical designs at 4,000 bits,
[07:15.000 --> 07:19.000]  and I mean theoretical to the point that there is a schema,
[07:19.000 --> 07:21.000]  theoretical in this case, I mean that there are actual
[07:21.000 --> 07:24.000]  schematics of how you could write a chip
[07:24.000 --> 07:26.000]  and they have even simulation of the performance
[07:26.000 --> 07:29.000]  that the chip would get in certain algorithms
[07:29.000 --> 07:32.000]  as to whether it's actually practically implementable
[07:32.000 --> 07:36.000]  in an existing industrial process.
[07:36.000 --> 07:37.000]  I don't know.
[07:37.000 --> 07:39.000]  I'm not a specialist in electronics,
[07:39.000 --> 07:42.000]  but that sounds a little bit questionable.
[07:42.000 --> 07:45.000]  So, in theory, on the syntactic level,
[07:45.000 --> 07:47.000]  so in the instruction and coding level,
[07:47.000 --> 07:50.000]  you can have up to two power 16 bits,
[07:50.000 --> 07:51.000]  at least on S5.
[07:51.000 --> 07:54.000]  I'm not sure about that, actually.
[07:54.000 --> 07:57.000]  So, how you actually do vector lengths,
[07:57.000 --> 07:59.000]  how you're supposed to do a viable vector length,
[07:59.000 --> 08:02.000]  a SIMD or vector processing, as it's often called,
[08:02.000 --> 08:07.000]  also, practically vector and SIMD are synonyms.
[08:07.000 --> 08:10.000]  Well, at first you have to use predication,
[08:10.000 --> 08:16.000]  which is very highly prevalent in viable vector length scenarios.
[08:16.000 --> 08:18.000]  Now, it's not a completely new concept.
[08:18.000 --> 08:22.000]  Kieran mentioned the K-mask in AVX,
[08:22.000 --> 08:25.000]  so AVX also has predication,
[08:25.000 --> 08:30.000]  but in viable vector lengths, it's really essential
[08:30.000 --> 08:33.000]  because this is basically the programming model
[08:33.000 --> 08:37.000]  on viable vector lengths and or loops
[08:37.000 --> 08:39.000]  is essentially built on predication.
[08:39.000 --> 08:42.000]  And that's true both for ARM and RISC-5.
[08:42.000 --> 08:45.000]  So, a predicate is a vector of Boolean.
[08:45.000 --> 08:47.000]  So, like the K-mask in X86,
[08:47.000 --> 08:52.000]  it's called the p-vector in ARM,
[08:52.000 --> 08:56.000]  and it's the mask vector in RISC-5.
[08:56.000 --> 08:58.000]  And as Kieran said, kind of repeating,
[08:58.000 --> 09:03.000]  but it just specifies which of the elements in your vector,
[09:03.000 --> 09:05.000]  it specifies which ones you will be loading
[09:05.000 --> 09:08.000]  or modifying or storing out of a given instruction.
[09:08.000 --> 09:09.000]  So, if it's a load instruction,
[09:09.000 --> 09:13.000]  which values you load for memory and overwrite into the register,
[09:13.000 --> 09:14.000]  if it's a stored instruction,
[09:14.000 --> 09:15.000]  it's going to be the other way,
[09:15.000 --> 09:17.000]  which values in memory are going to overwrite
[09:17.000 --> 09:19.000]  versus which ones are going to live as they are.
[09:19.000 --> 09:21.000]  And if it's a calculation instruction,
[09:21.000 --> 09:25.000]  vector to vector, then it's going to affect which ones,
[09:25.000 --> 09:27.000]  which results are actually stored into the register
[09:27.000 --> 09:30.000]  versus which ones are just discarded.
[09:30.000 --> 09:34.000]  So, on ARM-V9 or SVE,
[09:34.000 --> 09:37.000]  one way you would typically do now your SVE loop
[09:37.000 --> 09:39.000]  instead of, say, your NEON loop,
[09:39.000 --> 09:42.000]  is you would start by counting down,
[09:42.000 --> 09:45.000]  you would initialize, say, extend to a zero,
[09:45.000 --> 09:47.000]  and then you would...
[09:47.000 --> 09:49.000]  So, you have to imagine here
[09:49.000 --> 09:53.000]  that you have your actual NEON or SVE loop,
[09:53.000 --> 09:54.000]  so you will check...
[09:54.000 --> 09:55.000]  You have this funny instruction,
[09:55.000 --> 09:58.000]  which is actually called YLT or YLLO,
[09:58.000 --> 10:00.000]  and you initialize the p-zero vector in this case,
[10:00.000 --> 10:03.000]  which is one of the predicate registers
[10:03.000 --> 10:07.000]  to say that, essentially,
[10:07.000 --> 10:09.000]  you want to count how many elements you still have
[10:09.000 --> 10:10.000]  in your input data.
[10:10.000 --> 10:11.000]  So, here, we have...
[10:11.000 --> 10:13.000]  We imagine that X0 is the number of elements
[10:13.000 --> 10:15.000]  we have been given to this function.
[10:15.000 --> 10:17.000]  X10 is the count of how far we've been,
[10:17.000 --> 10:20.000]  so it's our iterator.
[10:20.000 --> 10:21.000]  And we'll say...
[10:21.000 --> 10:22.000]  Essentially, what we'll say is,
[10:22.000 --> 10:25.000]  as long as X10 is larger...
[10:25.000 --> 10:27.000]  As long as the number of elements we still have...
[10:27.000 --> 10:30.000]  So, as long as X0 is larger than the size
[10:30.000 --> 10:32.000]  of the vector that the CPU can handle,
[10:32.000 --> 10:35.000]  we'll just set the predicate to handle to be clear,
[10:35.000 --> 10:39.000]  so we'll use the full size of the vector for our programming.
[10:39.000 --> 10:43.000]  And once the number of elements is more than zero,
[10:43.000 --> 10:45.000]  but strictly less than the vector size
[10:45.000 --> 10:46.000]  than the CPU can handle,
[10:46.000 --> 10:49.000]  then we'll start basically just ignoring the values
[10:49.000 --> 10:50.000]  at the end of the vector,
[10:50.000 --> 10:51.000]  so we'll have a bunch of ones,
[10:51.000 --> 10:53.000]  and then at the end, a bunch of zeros.
[10:53.000 --> 10:55.000]  And this is how you abstract away and hide away
[10:55.000 --> 10:59.000]  the complexity of dealing with the edge in your loop.
[10:59.000 --> 11:01.000]  So, essentially, by doing this,
[11:01.000 --> 11:03.000]  you don't care what is the actual capacity of...
[11:03.000 --> 11:05.000]  You don't actually need, at any point, to know
[11:05.000 --> 11:07.000]  how many elements you're dealing with
[11:07.000 --> 11:08.000]  in any iteration of your loop,
[11:08.000 --> 11:11.000]  because it's all hidden away by the...
[11:11.000 --> 11:12.000]  Essentially, the size of the vector
[11:12.000 --> 11:13.000]  and the size of the predicate are matched,
[11:13.000 --> 11:14.000]  so you don't actually care.
[11:14.000 --> 11:16.000]  And you also don't need to deal with edges,
[11:16.000 --> 11:18.000]  because, well, even if there's one or two or three
[11:18.000 --> 11:20.000]  or four elements left over at the end,
[11:20.000 --> 11:23.000]  you can just deal with them in the last iteration,
[11:23.000 --> 11:25.000]  which, of course, will be a little bit less efficient
[11:25.000 --> 11:28.000]  than using the full size of the vector,
[11:28.000 --> 11:30.000]  but it's still much faster than having a separate edge
[11:30.000 --> 11:32.000]  if only because you will not be stressing
[11:32.000 --> 11:36.000]  the instruction cache of the CPU.
[11:36.000 --> 11:39.000]  So that's predication.
[11:39.000 --> 11:40.000]  Now, unrolling.
[11:40.000 --> 11:41.000]  If you thought about it,
[11:41.000 --> 11:43.000]  all that I just said with predication,
[11:43.000 --> 11:45.000]  it doesn't really work with unrolling,
[11:45.000 --> 11:47.000]  because now you've counted down...
[11:47.000 --> 11:49.000]  You've set your predicate vector to count down
[11:49.000 --> 11:51.000]  how many elements you have still
[11:51.000 --> 11:54.000]  in your total count of elements.
[11:54.000 --> 11:56.000]  You can't unroll, because now, like,
[11:56.000 --> 11:58.000]  you've said, oh, I have 10 elements left,
[11:58.000 --> 12:00.000]  I'm going to use 10 elements in my vector,
[12:00.000 --> 12:02.000]  but if you have...
[12:02.000 --> 12:03.000]  It just doesn't work, like,
[12:03.000 --> 12:05.000]  because if you had, like, one and a half vector left,
[12:05.000 --> 12:08.000]  you would want to have one predicate with all the bit set
[12:08.000 --> 12:10.000]  and another predicate with half of the bit set.
[12:10.000 --> 12:12.000]  This doesn't really work very well.
[12:12.000 --> 12:15.000]  And, yes, now, it's a bit of a hot tech.
[12:15.000 --> 12:17.000]  Maybe I will never be ever again allowed
[12:17.000 --> 12:19.000]  to write a session-peck code after this,
[12:19.000 --> 12:24.000]  but just don't unroll if you use variable vector lengths.
[12:24.000 --> 12:26.000]  Now, there may be cases where you can unroll
[12:26.000 --> 12:30.000]  because, naturally, you have some kind of parallel
[12:30.000 --> 12:33.000]  in your design aspect in your algorithm,
[12:33.000 --> 12:37.000]  but the idea of vector processing
[12:37.000 --> 12:41.000]  is that we have higher latency and larger vectors,
[12:41.000 --> 12:44.000]  which, in the end, result in higher throughput,
[12:44.000 --> 12:47.000]  and we shouldn't need to unroll.
[12:47.000 --> 12:51.000]  I'm sure you will find actual designs real hardware,
[12:51.000 --> 12:53.000]  real processors, where it will be faster if you do unroll,
[12:53.000 --> 12:59.000]  and how much you need to unroll will depend on the design.
[12:59.000 --> 13:02.000]  And, of course, if you are trying to squeeze the very last
[13:02.000 --> 13:06.000]  MIPS out of a given specific piece of hardware,
[13:06.000 --> 13:08.000]  then maybe you should unroll,
[13:08.000 --> 13:10.000]  but, I think, generally speaking,
[13:10.000 --> 13:13.000]  at least you shouldn't start by unrolling.
[13:13.000 --> 13:15.000]  And another interesting thing to keep in mind,
[13:15.000 --> 13:19.000]  which kind of already mentioned in the previous slide,
[13:19.000 --> 13:22.000]  is that you don't have alignment issues.
[13:22.000 --> 13:24.000]  The one common problem with CMD instruction set
[13:24.000 --> 13:26.000]  is that the load and store instructions
[13:26.000 --> 13:28.000]  require overaligned data,
[13:28.000 --> 13:30.000]  typically aligned on the side of the vector,
[13:30.000 --> 13:32.000]  which is very inconvenient when you're operating
[13:32.000 --> 13:36.000]  from C or C++ code, because it's usually C or C++ allocator
[13:36.000 --> 13:40.000]  will only allocate align on whatever the ABI specifies,
[13:40.000 --> 13:42.000]  which on RBA, it would be 16 bytes for the stack
[13:42.000 --> 13:46.000]  and 8 bytes for the heap.
[13:46.000 --> 13:50.000]  So, usually, while at least both SV and RIC5 vectors,
[13:50.000 --> 13:53.000]  the alignment needed is only the alignment of the element,
[13:53.000 --> 13:56.000]  and it's not the alignment, it's not the side of the vector.
[13:56.000 --> 14:01.000]  So, if you are operating on, say, 4 bytes pieces of data elements,
[14:01.000 --> 14:04.000]  then you only need your vectors to be aligned on 4 bytes,
[14:04.000 --> 14:06.000]  which is a very nice property for dealing,
[14:06.000 --> 14:09.000]  especially on the edge cases,
[14:09.000 --> 14:11.000]  and also you don't have to deal with,
[14:11.000 --> 14:15.000]  like, if you have one input that is perfectly aligned
[14:15.000 --> 14:17.000]  and the output is not perfectly aligned,
[14:17.000 --> 14:20.000]  like, you end up having this weird mismatch
[14:20.000 --> 14:22.000]  and you end up having to deal with different edge cases,
[14:22.000 --> 14:23.000]  it's really a mess.
[14:23.000 --> 14:25.000]  With vector processing, you don't do that,
[14:25.000 --> 14:29.000]  so you don't actually have to worry about it.
[14:29.000 --> 14:32.000]  So, with that, we've covered generality,
[14:32.000 --> 14:34.000]  so how is it looking on ARM side,
[14:34.000 --> 14:37.000]  and then we'll see RIC5 side, because it's a bit weird
[14:37.000 --> 14:38.000]  if I would...
[14:38.000 --> 14:40.000]  I thought, like, to put everything together,
[14:40.000 --> 14:41.000]  but then it becomes a huge mess.
[14:41.000 --> 14:43.000]  So, it's going to be a bit repetitive,
[14:43.000 --> 14:45.000]  because, of course, there's a lot of similarities,
[14:45.000 --> 14:50.000]  so SVE came about, like, five years ago,
[14:50.000 --> 14:52.000]  a little bit more than five years ago,
[14:52.000 --> 14:55.000]  I think it was announced late 2016,
[14:55.000 --> 14:56.000]  if I recall correctly.
[14:56.000 --> 14:59.000]  It was pretty much less on multimedia.
[14:59.000 --> 15:01.000]  It was explicitly meant for other things,
[15:01.000 --> 15:04.000]  like, well, scientific applications,
[15:04.000 --> 15:09.000]  or engineering modeling and this kind of stuff,
[15:09.000 --> 15:14.000]  well, HPC, and so nobody used it.
[15:14.000 --> 15:16.000]  At least nobody in this room used it.
[15:16.000 --> 15:19.000]  This was fixed with SVE2,
[15:19.000 --> 15:22.000]  which is sometimes called ARMv9,
[15:22.000 --> 15:25.000]  because it kind of comes with ARMv9,
[15:25.000 --> 15:29.000]  but it's really called SVE2.
[15:29.000 --> 15:31.000]  Fixed that issue, the realisation that, actually,
[15:31.000 --> 15:32.000]  this is a good idea.
[15:32.000 --> 15:34.000]  This pattern programming model is also interesting
[15:34.000 --> 15:36.000]  for multimedia and crypto,
[15:36.000 --> 15:40.000]  which was also missing from SVE1.
[15:40.000 --> 15:42.000]  And so what they did is they just took,
[15:42.000 --> 15:44.000]  so which neomonics are missing,
[15:44.000 --> 15:47.000]  and added those,
[15:47.000 --> 15:49.000]  and it's pretty much the same mnemonics
[15:49.000 --> 15:52.000]  you just add the predicate register.
[15:52.000 --> 15:54.000]  That's why this is, of course, a little bit more complicated,
[15:54.000 --> 15:57.000]  but as I mentioned, you just use a while instruction,
[15:57.000 --> 15:59.000]  which will then provision your predicate,
[15:59.000 --> 16:01.000]  and you have to pick the element size
[16:01.000 --> 16:03.000]  so that, of course, this adds up correctly,
[16:03.000 --> 16:05.000]  and then you have a new set of branch conditions,
[16:05.000 --> 16:11.000]  so first element, last element, and so on and so forth.
[16:11.000 --> 16:14.000]  So the remaining elements will be determined
[16:14.000 --> 16:16.000]  by the predicate register,
[16:16.000 --> 16:18.000]  and the predicate register will set the condition flag,
[16:18.000 --> 16:21.000]  and the while instruction will also subtract.
[16:21.000 --> 16:23.000]  There is a count, the number of processed elements
[16:23.000 --> 16:26.000]  from your output register.
[16:26.000 --> 16:30.000]  And yeah, at this point, stop pretending that I'm at risk.
[16:30.000 --> 16:31.000]  How do you detect this stuff?
[16:31.000 --> 16:33.000]  So there's a processor macro,
[16:33.000 --> 16:35.000]  otherwise, as usual, on ARMv8,
[16:35.000 --> 16:37.000]  you have a bunch of privileged registers
[16:37.000 --> 16:39.000]  for the OS to look at,
[16:39.000 --> 16:41.000]  and then you have also Linux,
[16:41.000 --> 16:44.000]  you have a bunch of flags in the auxiliary vector bit,
[16:44.000 --> 16:46.000]  so it's all classic.
[16:46.000 --> 16:49.000]  Another OS that you're out of luck.
[16:49.000 --> 16:52.000]  Availability, so as we said, 2016,
[16:52.000 --> 16:54.000]  but it didn't really work for us.
[16:54.000 --> 16:56.000]  SV2 was specified in 2019,
[16:56.000 --> 17:00.000]  but so the real hardware came earlier last year,
[17:00.000 --> 17:03.000]  so Cortex-AX2 and all the other things
[17:03.000 --> 17:07.000]  from dynamic IQ 110.
[17:07.000 --> 17:10.000]  So Samsung actually knows 2,200,
[17:10.000 --> 17:12.000]  and so Cortex-AX2 and all the other things,
[17:12.000 --> 17:14.000]  they do have SVE,
[17:14.000 --> 17:17.000]  unfortunately, it's only 128-bit vectors,
[17:17.000 --> 17:19.000]  and it's pretty damn expensive,
[17:19.000 --> 17:22.000]  but if you want to do it, you can find the hardware.
[17:22.000 --> 17:25.000]  So RIS5, it's a different model.
[17:25.000 --> 17:26.000]  Can I add?
[17:26.000 --> 17:27.000]  Yeah.
[17:27.000 --> 17:30.000]  There's also the Alibaba one, the Yi-Tian.
[17:30.000 --> 17:31.000]  Yeah, maybe.
[17:31.000 --> 17:32.000]  It's possible, yes.
[17:32.000 --> 17:35.000]  It's only available in China, but it's available.
[17:35.000 --> 17:38.000]  So on RIS5, the predication is a little bit different,
[17:38.000 --> 17:42.000]  so they have separation between element count
[17:42.000 --> 17:44.000]  and the actual predicate.
[17:44.000 --> 17:46.000]  And so in practice in multimedia, maybe not in David,
[17:46.000 --> 17:48.000]  but usually you don't use the predicate at all,
[17:48.000 --> 17:51.000]  so we will instead just count the elements.
[17:51.000 --> 17:53.000]  This is the instruction you always find
[17:53.000 --> 17:56.000]  at the beginning of the loop, which considers the vectors.
[17:56.000 --> 17:59.000]  So in this case, what we say is that we have
[17:59.000 --> 18:01.000]  a certain number of input elements.
[18:01.000 --> 18:05.000]  We want to get the number of output parameters
[18:05.000 --> 18:08.000]  and the number of elements the CPU will deal with
[18:08.000 --> 18:10.000]  in the iteration.
[18:10.000 --> 18:12.000]  We then have to say the size of the element in bits,
[18:12.000 --> 18:15.000]  in this case, for instance, 16 bits.
[18:15.000 --> 18:17.000]  The group size, which is kind of free unrolling,
[18:17.000 --> 18:19.000]  it will automatically, if you set it to 2,
[18:19.000 --> 18:22.000]  it will use all the, and you say you want to use
[18:22.000 --> 18:24.000]  vector 8, it will use vector 8 and vector 9
[18:24.000 --> 18:26.000]  at the same time, for instance.
[18:26.000 --> 18:29.000]  And tail mode, we always set it to agnostic
[18:29.000 --> 18:31.000]  because we don't really care about tail mode
[18:31.000 --> 18:33.000]  and mask mode, we also always set it to agnostic.
[18:33.000 --> 18:35.000]  There might be use cases where you need to do something else,
[18:35.000 --> 18:39.000]  which might be a little bit slower, but usually you don't.
[18:39.000 --> 18:41.000]  This is about how to deal with the stuff that is masked
[18:41.000 --> 18:43.000]  or with the element that are at the end of the vector
[18:43.000 --> 18:44.000]  which we don't care about.
[18:44.000 --> 18:45.000]  Usually you don't care about them,
[18:45.000 --> 18:48.000]  so you just tell the CPU you don't care about them.
[18:48.000 --> 18:50.000]  One cool thing about RISC-V,
[18:50.000 --> 18:52.000]  the floating point registers are separate from the vectors
[18:52.000 --> 18:55.000]  and like on ARM, so you have more registers available
[18:55.000 --> 18:58.000]  if you have hybrid calculations between scalar and vector side.
[18:58.000 --> 19:00.000]  But do mind the floating point convention,
[19:00.000 --> 19:03.000]  calling convention when this happens,
[19:03.000 --> 19:06.000]  otherwise you will screw up your register state
[19:06.000 --> 19:09.000]  and confuse your CPU.
[19:09.000 --> 19:12.000]  The interesting stuff also about RISC-V,
[19:12.000 --> 19:13.000]  they have segmented load and store,
[19:13.000 --> 19:16.000]  which is similar to structured load and store in ARM,
[19:16.000 --> 19:19.000]  but they can do it up to 8 structures,
[19:19.000 --> 19:22.000]  whereas ARM is only up to 4.
[19:22.000 --> 19:26.000]  What is much more interesting perhaps is
[19:26.000 --> 19:29.000]  strided loads and store where you can say,
[19:29.000 --> 19:32.000]  I have this register X which contains a value
[19:32.000 --> 19:33.000]  and that's going to be my stride.
[19:33.000 --> 19:35.000]  So for instance with that you can put
[19:35.000 --> 19:37.000]  the width of your video inside one register
[19:37.000 --> 19:41.000]  and you can load all the pixels in a column in an instruction
[19:41.000 --> 19:44.000]  without having to do weird shuffling and whatever.
[19:44.000 --> 19:46.000]  Does that actually perform a practice?
[19:46.000 --> 19:48.000]  I think that's going to depend on the design,
[19:48.000 --> 19:51.000]  but normally it should be in the data cache
[19:51.000 --> 19:53.000]  which should be okay.
[19:53.000 --> 20:00.000]  So I'll come to that.
[20:00.000 --> 20:04.000]  Yes, on the downside you don't have transposer
[20:04.000 --> 20:07.000]  or zipping instructions, which should be annoying,
[20:07.000 --> 20:08.000]  which is kind of the same,
[20:08.000 --> 20:10.000]  so you have to replace it with strides.
[20:10.000 --> 20:13.000]  So it's fine if you want to take every second element
[20:13.000 --> 20:16.000]  from one vector and so on.
[20:16.000 --> 20:21.000]  Feature detection, they have very, very detailed
[20:21.000 --> 20:25.000]  pre-processor feature flags.
[20:25.000 --> 20:27.000]  I mean you can download the slides if you're interested.
[20:27.000 --> 20:30.000]  On the other hand, on runtime detection it's pretty crappy.
[20:30.000 --> 20:33.000]  You have to trust the device tree node.
[20:33.000 --> 20:36.000]  So you have to trust the boot loader to actually tell it
[20:36.000 --> 20:40.000]  to your OS correctly in the device tree data structure
[20:40.000 --> 20:42.000]  and otherwise there is a flag in there.
[20:42.000 --> 20:46.000]  So the V, the Vth bit, so the 21,
[20:46.000 --> 20:48.000]  because V is the 22nd later in the alphabet
[20:48.000 --> 20:51.000]  is a vector flag in the auxiliary vector
[20:51.000 --> 20:53.000]  for hardware capabilities on Linux.
[20:53.000 --> 20:58.000]  Availability, unfortunately at this time there is no hardware.
[20:58.000 --> 21:02.000]  Ali Baba, sorry, T-Head has made hardware available
[21:02.000 --> 21:06.000]  but it's implementing version 0.71
[21:06.000 --> 21:11.000]  from about 18 months before the standardised specification
[21:11.000 --> 21:14.000]  which is implemented by Clang and GCC.
[21:14.000 --> 21:16.000]  So you can kind of work with that
[21:16.000 --> 21:18.000]  and it gives you some idea of the performance
[21:18.000 --> 21:20.000]  but you're going to have to rewrite stuff
[21:20.000 --> 21:22.000]  because it's not completely bit compatible
[21:22.000 --> 21:24.000]  so it's kind of annoying.
[21:24.000 --> 21:26.000]  I don't know when the stuff is going to happen.
[21:26.000 --> 21:27.000]  I'm pretty sure it's going to happen
[21:27.000 --> 21:30.000]  but I would guess by the end of this year
[21:30.000 --> 21:35.000]  we are going to see hardware available.
[21:35.000 --> 21:38.000]  Also I think one kind of not answering
[21:38.000 --> 21:39.000]  or dodging the previous question
[21:39.000 --> 21:42.000]  but because we have so many different vendors on RISC 5
[21:42.000 --> 21:44.000]  and I think there's more than I did.
[21:44.000 --> 21:48.000]  I only listed three here but I think there's other.
[21:48.000 --> 21:50.000]  There might be big difference
[21:50.000 --> 21:51.000]  between the performance characteristics
[21:51.000 --> 21:53.000]  of the different vendors.
[21:53.000 --> 21:55.000]  These are our references.
[22:01.000 --> 22:06.000]  Yes, I have just a few questions.
[22:06.000 --> 22:11.000]  Have you heard of the SVP64 project from Lever SoC yet
[22:11.000 --> 22:17.000]  which is a kind of similar vector approach for PowerPC?
[22:17.000 --> 22:19.000]  No, I haven't looked at PowerPC at all.
[22:19.000 --> 22:23.000]  Another question that I had with my own CIDD programming workers
[22:23.000 --> 22:26.000]  we often have applications that are inherently horizontal.
[22:26.000 --> 22:30.000]  For example, let's say you are writing a vectorized string search operation
[22:30.000 --> 22:32.000]  or you're doing something like decoding JPEGs
[22:32.000 --> 22:34.000]  where you have these 8.8 blocks
[22:34.000 --> 22:37.000]  where you want to do some sort of close-in transform on them
[22:37.000 --> 22:38.000]  and they have this fixed size
[22:38.000 --> 22:40.000]  and depending on the vector size
[22:40.000 --> 22:41.000]  you want to break them up
[22:41.000 --> 22:44.000]  or you maybe have to process multiple of them at the same time.
[22:44.000 --> 22:46.000]  Is there an intelligent way to solve this?
[22:46.000 --> 22:48.000]  I've had this case.
[22:48.000 --> 22:51.000]  The question is when you have a naturally fixed size
[22:51.000 --> 22:55.000]  input kind of block that you want to process at the time
[22:55.000 --> 22:56.000]  how do you do this?
[22:56.000 --> 22:59.000]  Because then you actually want to have a fixed size vector
[22:59.000 --> 23:01.000]  in effect, paraphrasing the question.
[23:01.000 --> 23:04.000]  I've had this case with the SVP64 a couple of times.
[23:04.000 --> 23:07.000]  One way is to just check that the vector size of the CPU
[23:07.000 --> 23:09.000]  is big enough and just do one at a time.
[23:09.000 --> 23:11.000]  If you can, try to do it at a time
[23:11.000 --> 23:13.000]  because it's always going to be a power of 2
[23:13.000 --> 23:16.000]  so you should be able relatively easily to parallelize.
[23:16.000 --> 23:18.000]  Obviously the ideal situation is to parallelize.
[23:18.000 --> 23:21.000]  What you will have a problem is if your dataset is larger than the vector
[23:21.000 --> 23:23.000]  then it's going to become complicated for you.
[23:23.000 --> 23:28.000]  On RISC-5 you can deal with this with the group multiplier
[23:28.000 --> 23:32.000]  which will allow you to use multiple vectors as a single vector.
[23:32.000 --> 23:38.000]  And the last question I have is how do you realistically test
[23:38.000 --> 23:40.000]  vectorized triangles?
[23:40.000 --> 23:43.000]  When the hardware you have only supports one vector length at most
[23:43.000 --> 23:46.000]  so you have to probably use some sort of relation to set up for this?
[23:46.000 --> 23:49.000]  Most of the loops will not depend.
[23:49.000 --> 23:51.000]  So the question is how do you test a different vector size
[23:51.000 --> 23:53.000]  for validation I guess.
[23:53.000 --> 23:56.000]  Most of the loops don't really care about the vector size
[23:56.000 --> 23:59.000]  because if you have a simple case where you follow the simple pattern
[23:59.000 --> 24:01.000]  it doesn't really care what the vector size is
[24:01.000 --> 24:04.000]  except for benchmarking of course and you have a problem.
[24:04.000 --> 24:09.000]  Otherwise QMU and Spark at least for RISC-5 support any vector size
[24:09.000 --> 24:12.000]  to give that as long as it's a valid one from specification plan point.
[24:12.000 --> 24:14.000]  Do you realistically really test for that?
[24:14.000 --> 24:17.000]  Or do you just say it's simply not going to be a problem?
[24:17.000 --> 24:22.000]  I mean personally when I've had the situation where I had a fixed size input
[24:22.000 --> 24:25.000]  and I had to test with different vector size and I tested with different vector size
[24:25.000 --> 24:28.000]  in most cases you don't actually care.
[24:28.000 --> 24:31.000]  I mean then it's a matter of choice of you do your testing
[24:31.000 --> 24:35.000]  and no strict you want to be with the validation I think.
[24:35.000 --> 24:36.000]  Thank you.
[24:36.000 --> 24:38.000]  We have no one on question now?
[24:38.000 --> 24:43.000]  Firstly disclaimer, I'm related to the Leversock project with SB64.
[24:43.000 --> 24:46.000]  It's similar to RISC-5 vectors but not exactly the same
[24:46.000 --> 24:49.000]  but they share a lot of the common ideas.
[24:49.000 --> 24:53.000]  You mentioned a very good point that CMD is not vector processing.
[24:53.000 --> 24:58.000]  In order we had to try to report some code from Neon to SV2
[24:58.000 --> 25:02.000]  and it was less than suboptimal let's say.
[25:02.000 --> 25:31.000]  We had to revert back to the original C algorithm.