Okay, so yeah, I'm the filthy, uh,
from over guy that asked the intrinsic question. Um,
and my name is flow and nice to meet you. Um,
last year, my colleague Adam,
I had a talk here about, um, the open source VBC decoder, um,
VV deck and the encoder VVank here at foster and I,
I optimized the decoder for arm architectures in my master thesis,
which I will talk about now. So basically that's this on the right.
They can see the, um,
Zindi optimization of VV deck.
So VV deck is optimized for SSE 4.1 and AVX 2 4.2 was
needed. So 4.1 was enough.
And to be also able to run VV deck on arm architectures, um,
the, the open source project Zindi everywhere is used,
which basically like in this case,
parts the SSE implementation to arm architectures in the justice by
using, um, either built in functions, um,
or neon intrinsics in this case because it's arm and, um,
but it can also only use, um,
scarline implementations and tells the compiler to like vectorize it automatically.
So yeah, a combination of these. So my goal was to, um,
yeah, make it faster for arm. Um,
for that, uh, the first thing I did was identify the hotspots. Um,
I was profiling VV deck using instruments since I was using this M1 PC here.
And yeah, um, I divided the profiling into three steps.
First of all, I identified the most time consuming functions. Um,
with these I checked like the performance on arm versus the performance on X
86. And the third part was since VV deck is implementing,
uh, every Zindi function as a non vectorized version as well. Um,
I compared by wanted to know like how much, uh,
speed up does the Zindi implementation generates.
And with all this information,
I chose for the foremost promising function,
which basically means I wanted to get the biggest bang for the buck.
And yeah, I chose to optimize these four functions. Um,
on the left you can see these, the names of these four functions.
Don't mind the names. Um, the only thing that is interesting is like the speed
up. And, um, yeah,
this graphic shows the manual optimization optimization.
So the optimization I did versus the automated, um,
the automated optimization from Zindi everywhere.
And I visualized this for one of the JVET video sequences for a quantization
parameter of 43. And, uh, yeah,
you can definitely see that like two functions have a really nice acceleration
so compared to the Zindi everywhere implementation.
So in this case,
the apply load Zindi function and the X get SAD function,
but generally speaking, uh,
you can definitely notice that Zindi everywhere does a decent job and to,
yeah, in comparison to like just optimizing with C and forensics.
And yeah, after having a look at the single function accelerations,
I also wanted to know like how much is the impact of the optimization of these
four functions on the general, um,
on the total acceleration of VV deck.
So I measured 11 JVET video sequences two times, obviously,
since I need to compare them and average that for every or for,
um, common quantization parameters. And yeah,
the range is between 3% and 9%.
What is not definitely noticeable is that like with, um,
decreasing quantization parameter, the speed up gets, um, lower.
And this is because the bit rate is higher with lower quantization parameters.
This may, uh, this is because, um, not this is because, but, um,
and because of that, like the decoding of the entropy decoding is getting more
complex and yeah, it gets a bigger piece of the cake.
So yeah, that was basically my master thesis in a nutshell.
And after that, um,
I also integrated like Zimdi everywhere to, um,
to, to port the AVX to implementation, to arm,
which also led to a contribution, which was pretty nice.
It led to a conclusion to Zimdi everywhere since there were some,
some errors in the portation. And right now,
since there's also an encoder, I'm repeating the optimization for VVNG.
And in the future,
we might also optimize for the scalable vector extension like directly,
or the scalable matrix extension. So yeah,
thank you for joining for us.
If you have any questions for free to ask, you can also ask me at the,
I don't know, post foster and drink up. I don't know.
Yeah.
I have one.
So what is translating across all the speed presets when you do the encoding,
the decoding improvements? Uh, what the presets? Sorry.
So when you do the encoding, you have different presets, right?
So I didn't know that's
you are asking about the encoding.
So after the encoding,
when you decode, right,
does it translate across all the presets for recording?
Cause every preset may not have all the tools.
Uh, yeah, that's true.
So the question was like, um, they are different, um,
like there are different presets in the encoding,
which affect the functions called in the, um, decoding. This is, um, true.
I mean, I did like, uh,
I tried to get a general overview,
which functions were used by like profiling several, um,
yeah, several, um,
settings and tried,
uh, yeah,
and tried to figure out which functions were used most and average that
basically. So yeah,
there's a like a bigger story behind that behind the profiling, obviously,
since this was only a five minute talk.
And yeah.
Does this mean that I can use a Raspberry Pi now to decode it?
Have you tried to use the ARM devices to see?
Okay. So the question is, um,
can I use a Raspberry Pi to, uh, to decode it?
And I mean the Raspberry Pi is based on an ARM, right? And I would say,
yeah, obviously you can write because, um,
I mean, you could do it before as well because Zimdia everywhere was included
and Zimdia everywhere ports, uh,
the SSE implementation to ARM, which, um,
I mean it doesn't do it like perfectly, obviously,
but we actually, um,
submitted a paper or some colleagues of me are submit to the paper at a mile
high conference. Um, yeah, I mean, I can,
I mean, I can even probably put it up on, for, on the foster side, maybe.
If you want to see that where they, um,
like measure the performance of Zimdia everywhere on ARM and,
so much of examples of what platforms,
I mean the platforms, uh,
which are supported are also like visible on the GitHub repository.
So, um, yeah, this is also, um, on foster website.
Um, like when you go to my talk, like there's the VVDec repository linked to it.
And there you can see it. Um, yeah.
That's tight.
There's another question.
Why don't you probably,
you're simply by hand instead of using the quality performance in June?
I mean, obviously we are still the best, right?
So we are still the best when it comes to decoding and encode.
But, um, like VV,
like VVDec and VVN is performing pretty well in comparison to other VVC,
um, coders, I would say, right?
Um, yeah, that's true. That's obviously true. I mean,
of course we have a head start, but,
yeah, but I mean, let's see, right? I think nothing better than a healthy competition.
Yeah.