Thank you everyone for coming.
So I'm going to do a quick talk on Porty Software at RISV.
So thank you very much for attending.
So just to quickly introduce myself, I'm a software engineer and team leader at RIVOS.
We are a hardware company doing RISV CPUs.
I work on the management time team, where on the management time system arrays and profiling team.
So our scope is we work a lot on OpenJDK, Python, Go, system libraries like OpenBLAS,
Sleaf, like Math, stuff and everything profiling.
I also have a hat at the language and time working group at RISE.
So what is RISE?
It's a collaborative effort to accelerate the development of open-source software for the RISV architecture.
So we are basically a consortium of companies who are interested in porting software to RISV.
We're investing a lot in that.
I'm going to get back to it, but we're doing a lot and we would love to have you involved as well.
The focus of this working group is on OpenJDK, Go, Python, .NET, Android, Runtime and V8.
Most of them already support RISV to varying degree like we're also going to see after.
The focus is really on the compilers of these different runtimes, on the runtime themselves,
like the libraries, the base class libraries and everything, but also on the ecosystem.
Make sure that the most used Java libraries, most used .NET libraries, most used Python libraries and everything
are well supported on RISV.
Also, my last hat is also part of Adapter Working Group, where we are distributing Java.
We're making sure that there is a Java distribution available for 11, 17, 21.
It's in progress. We're getting close, but you should soon have a distribution,
like a rendered distribution of Java on RISV.
Let me just increase the size here.
Who is the intending audience of this talk?
The talk is really for people who have some experience in RISV and who want to get more involved,
but also for people who have very little experience with RISV or no experience with RISV.
But it sounds exciting, right?
And it is really exciting.
There's a lot of work to do and it's a lot of fun.
So I will not talk assembly. Don't get scared.
And if you don't know a concept or words, please ask.
I would love to have a bit of interaction.
Also, the target is application system.
So anything like smart phone, laptop, desktop, servers, and HPC.
We're not going to talk about embedded.
We're not going to talk about microcontrollers.
That's not the topic of this talk.
So first of all, I want to give a huge shout out to Unsexyful.
They made a lot of things a lot easier to port to RISV.
So there have been years of investment in porting a lot of software to RISV.
And that just makes a path easier for RISV.
There's a lot of libraries out there, for example, which support X86, RISV, PPC, and
F390X.
So adding support to RISV to that is very straightforward.
Because well, adding RISV is just like one more flag, one more configuration somewhere,
and it's pretty easy.
There's already all the if-desfs.
There's already the compiler support.
There's also already the cross-compilation support, CI setup, and everything.
So RISV is just like one more thing.
For libraries and projects which only support X86, though, the work is a bit more involved.
Obviously, we have to support, we have to add support to the build systems to support,
for example, cross-compilation.
For the sources, we have to teach it that well, not everything is X86.
There is some assumptions about X86.
And so you want to make sure that you root out all of these issues.
In terms of resources, in the RISV ecosystem, the kind of the reference is the RISV GitHub
organization.
It's spread out, but it's really the most complete.
For anything related to Scala instructions, you have like the RISV ISA manual.
For the vector instructions, you have the vector spec.
For the vector crypto instructions, you have the vector crypto.
For the vector spec, and for the vector intrinsics, you have the vector intrinsics spec.
So there is a bunch of documents.
They are a bit spread out, but they really are the reference.
The GitHub organization is really where the work happens.
There is work happening on mailing lists.
There is work happening in meetings.
But in the end, all the spec, all the documents, all the result documents of all the discussions
is on GitHub.
So GitHub really is the reference.
It's quite unique compared to, for example, X86 or ARM or other architectures.
It's just very open source.
It's very easy to access.
So it's easy to look for.
It's not easy to find, but it's easy to look for things.
Also, watch out for peer releases.
Things will change, may change in between the peer release and the release.
It will just be weird because, for example, for the vector spec, there was the 0.7 release,
which some hardware vendor have adopted for some of the boards.
But the encoding of some instructions or even instructions have completely changed, meaning
that if you are coding against vector 1.0, it's just not going to work on any board which
implements vector 0.7.
And it's just not going to work.
It's not, sometimes it's going to fail.
No, not going to work.
There is also this very nice resource I found recently, the RISV Intrinsic Viewer.
So it just allows you to look through the vector Intrinsics implemented in GCC and Clang.
There are 14,000 Intrinsics.
So it's very nice to have a nice thing to look through.
A lot of them are very complicated.
Like for example, there is like 5,000 about vector load and stones.
It's just because of the combinatorial explosion of what kind of load you want, what are the
sizes you want to load, what are the element sizes you want to load.
There's a lot of repetition.
There's just a lot of them, so it's easy to look through them.
So the first question you should ask yourself when you are wanting to receive is, what am
I targeting?
As you may be familiar, RISV has a concept of extensions.
And so you kind of want to understand, OK, I'm writing software, but who is going to
be able to run it?
Which boards are going to be able to run it?
The most basic one is Rv6 for GCC, for application servers or application processors.
That's really the most basic ones.
Even the first board, like high-five-enished support set.
Can of the second step is bit-manip.
So anything like ZBA, ZBB, ZBS, which allows you to do some bit manipulation, like rotate.
I don't have all of them in mind.
But like bit manipulation stuff, like where you want to look at one bit of a word or thing
like this, this is going to be in bit-manip.
A lot of boards implement that.
I think from the high-five-enmatch, which was like the second board released in the
market, it supports it.
So it's very, very, very common.
Then you have Vector.
Again, please use Vector 1.0.
Please don't target 0.7.
Today there is one board that supports it.
It has one core.
It's not very fast, but at least it supports it.
But nearly nothing supports it.
And Vector Crypto, don't want to support it yet.
It's probably going to come like one, two years down the line.
So that's really, we expect that to be the future.
But obviously it's been ratified two months ago or something.
So nothing implemented yet.
To kind of simplify things, there is the concept of profiles, which is a RISV international
concept with RVA 20, RVA 22, RVA 23.
There is RVA 24, there is discussion.
There's going to be RVA 25 and et cetera.
You can find the spec at this link as well.
But the idea is to kind of define a set of extensions that allow software to target a
specific profile and how we're to say we are RVA 22 compatible, we are RVA 23 compatible,
we are RVA 24 compatible.
And just like makes it easier than having to support a hundred different extensions and
you don't really know what to target.
So what's certain, what you can target today with very good certainty is RV64GC plus Bitman
Ip and that hardware probe for vector and vector crypto.
I'm going to talk about how we'll provide after.
What's putting my crystal ball, like using my crystal ball, I can say that the expectations
for the future is RV23 plus vector crypto.
Please don't quote me on that.
It's just a crystal ball.
That's the expectations from looking at.
But let's see how the future evolves, but that's the expectation.
Knowing that in RV23, vector would be mandatory.
So that would basically be RV64GC, Bitman, vector, vector crypto would be the future.
I don't know when it's going to happen either, but that's the expectations.
So in all of that, hardware probe is your friend.
So what is our probe?
It's a Linux kernel syscall that allows you to check for hardware capabilities.
For example, it's going to allow you to say, well, do my machine support ZBA?
Great.
It's going to tell you.
Do my machine support V?
Great.
It's going to tell you.
Do my machine support ZVKNHA, which is like SHA256 extension.
It's going to tell you.
It's very rapidly evolving.
For example, in the last Linux release, which coming out like 6.6 or 6.8, I don't remember,
they added like 15 or 20 extensions which are probed as part of the syscall.
So it's always better to be in the last version on Linux to use that.
And once you know what you want to target, you want to look at your compilers and run
times and libraries.
What is supported on RISV?
So there is always support for RISV in many compilers and run times.
For example, GCC, LVM, OpenJDK, Go, Python, .NET, V8, and RISV.
And anymore, do support RISV.
There's various degrees of quality and support of this RISV.
For example, I think kind of red of the curve, this GCC and VN OpenJDK support very well
RISV.
Bit back of the wave, I'm not going to cite them.
But it's, for example, they will only support RV64GC, which is functional, but it's not
very fast, but at least it's functional, right?
You can actually test it.
It's also very rapidly evolving.
For example, like with the team, we contribute regularly to the OpenJDK and there is like,
for every release, there is like dozens of commits improving.
Like for example, hey, now we support this extension.
Hey, now we actually are editing with these instructions.
So it's going very fast.
So it's important to really be on the latest and greatest.
Like for example, I said, I talked about the kernel before.
Same thing.
Vector was like six months ago or something.
So before that, you didn't have a Linux kernel release with support for Vector.
So it's very important to be on the latest ones.
Also it's more and more libraries.
Support RISV.
That's where most of the outcome work I expect because it's great for example that GCC and
LVM support RISV, but if any of your 20 data pens don't support RISV, like it's not going
to work, right?
You need all of the libraries to be supporting RISV.
So RVI is maintaining this very nice page, which it's not very full, very complete, but
it's a good reference, which kind of highlights some of the projects we have which are supporting
RISV.
Sorry.
If you project to RISV and don't sit on this page, please report it to our VDI.
I think they will be very happy to add it.
Also in that, huge shout out to all the contributors.
I don't know you, but thank you very much for all the work you're doing.
Without you, all of that would not be possible.
And also, many of them are doing it on their free time.
So thank you very much.
So what are some of the difficulties and gotchas that we ran into?
So the main one is like hereby assumptions.
A lot of code has been written for a long time for RISV.
For example, who assumes that the page is 4K, right?
So luckily, only five pages are 4,000 bits, but some architecture is in the past after
it, hey, what if a page is 16K?
Who knows what's going to happen?
Why don't things break?
So something specific to RISV, the vector length specific code.
Some architectures, like for example XA6 and Arm Neon, for example, they say a vector
is going to be 128 bits or 256 bits.
And the software engineers who write code is going to know that's the vector length
and I'm going to write my code for that.
RISV decided to be smart and decided that there should be no, everything should be vector
length and elastic.
Well, it makes it a bit harder to write software because a lot of software assumes that vector
are 256, 512, 128 bits and they just don't know what it means to be vector length and
elastic.
Also, can you know connans?
Whenever you try to represent a nan, in XA6, it's basically any value of nan is valid and
if you do a multiply of 1 by a nan, any value of nan is going to return you this value of
nan.
So for example, it's going to keep the sign.
That's kind of the main issue.
But in RISV, if you do 1.0 times a nan, or minus nan, it's going to return you connecal
nan.
That's part of the spec.
That's how hardware should behave.
That means that if you multiply 1 by a negative number, it's going to return you a positive
number.
Right?
It's not a number.
So what should be the behavior there?
Well, RISV says it should be positive.
So if you try to do each sign of this result on C++, it's going to return you something
different on XA6 and RISV.
So that's gotches.
Well, we will have to look out for that.
Something else, memory model.
It's stronger on XA6, weak on RISV.
I think I'm 64 degree up there as well of teaching people that weak memory model are
different and they're worth looking out for.
Something else that we are going to have to look at.
Also the vector spec simplicity.
It's very simple to program.
Hard to implement in hardware.
You have some instructions that can do something very complex.
And so to implement it efficiently in hardware, it can be very difficult or even impossible.
Meaning that sometimes you have your stream instructions.
Something takes like five cycles, 15 cycles, maybe 30 cycles.
And then suddenly one instructions cannot be executed by hardware, has to be emulated
and takes a thousand cycles.
And you don't see it as like your programming user space.
You just don't see it.
It just works.
But then you are like, oh, I vectorize my code and it's 50 times slower.
What happened?
Like I don't understand.
It should be eight times faster or 16 times faster.
It's slower.
What happens?
Well, that's going to be the problem.
So solutions here.
Test it.
Figure out what's happening.
Use profiling tools.
Figure out what can go wrong.
And eventually refer to vendor specific information like optimization manual that may tell you
this instruction is slow on the hardware because X20.
So great.
Now you are developing, compiling and testing.
Like, you know, how do you test things basically?
So here, QMU is with friends, also membo.
So functionally, QMU I think is the most complex.
Purely because when people want to make a new extension, they implemented on QMU to
test what it would mean.
So basically they prototype it on QMU.
And then whenever it's modified, it's like, well, it's there, right?
We prototype everything.
So it's just there.
So it's vector crypto, for example.
QMU was the first one because it was prototyped.
So it was easy.
Yes, please.
You just talked about how you need to measure the example of X instructions.
Yes.
How well does that translate to QMU?
Going to talk about it after.
Yes.
So the question was, how do we measure performance on QMU?
That's next slide.
Oh, second, next slide.
So user space simulation is also easy enough on the Ubuntu or Debian based.
You do add getting stored QMU static.
And then you can just run a Docker image with RISC-5 stuff.
Like you literally run RISC-5.
It's Ubuntu and RISC-5.
It just works.
So it's very easy to test out.
It's great for most testing.
There's some leak abstractions.
For example, on QMU, Parat is 7.0.
Proxy.info would read the host one.
So you're on a RISC-5 environment, RISC-5 binary.
You read proxy.info and tell you support AVX2.
Why?
Like, not in the right thing.
What's happening?
It's also not particularly, yes, please.
No.
I was wrong.
OK.
Post 7.0, that specific thing is emulated.
So you don't have this problem anymore.
But also QMU is not particularly fast.
It can be like five to 10 times slower, like on single-core performance.
So if you want to test, have a machine with 64 cores, things go faster.
Also linking for executable or libraries,
like large ones can take a long time because linking is usually single-core.
So it can be slow.
Also debugging can get pretty complicated because you attach GDB-grade.
But do you attach it to the X86 QMU process?
Or do you attach it to the RISC-5 that QMU is trying to emulate?
And yes.
QMU can act as a database server.
Yes.
But that's a bit broken in some ways sometimes.
And it's, yeah, sorry.
So the question was, GDB can act as a GDB server?
QMU, yes, sorry.
QMU can act as a GDB server.
It works most of the time, sometimes it doesn't.
That's where it can get a bit complicated and sometimes it's just like...
So that brings me to the next point of, well, the other way of doing it
is just cross-compilation and testing on deathboards.
So obviously you have faster build-up times because cross-compilation,
like your compiler are native, so it's just faster.
Per-MEs, well, first of all, does your project support cross-compilation?
That's not a given.
Also, today's boards have a mutation.
They don't support everything.
For example, I mentioned before, vector, vector, crypto.
Vector only one board supports it.
Vector, crypto, no one supports it.
So you cannot test any of these algorithms.
You are stuck to QMU for that.
Also, hardware bugs.
Not every board just behaves perfectly.
I'm not going to name names, but you have, for example,
some boards have atomic bugs.
So you do an atomic operation.
It's going to say success.
It's going to say failed if it did succeed.
So mutex can be a bit complicated to implement like that.
So you have bugs.
So on that, retries, again, is your friend.
If it fails once, try a second time.
If it succeeds, and it succeeds always in another board
that you know doesn't have the bug,
probably just do the hardware bug.
So for CI, QMU is your friend.
Again, on GitHub Actions, for example,
which is quite nice, a lot of free time,
it's a one-liner.
You do like use Docker setup QMU action.
And suddenly you have QMU setup on your machine.
It takes like three or five seconds to set up.
So it's also very fast.
You don't even need Docker.
You can just use QMU and D prefix and the C straight
to kind of help you have like a RISV file system
like slash ETC, slash user, slash home, slash everything.
And QMU is going to take care of loading the file from there
rather than the host X86 stuff.
And then it just works.
Like you have a RISV machine on GitHub Actions for free.
Slow, but it's functional.
Also, you can tweak the available CPU options.
For example, you can say, I want ZBA, ZBB, ZBS,
but I don't want vector.
For example, I'm working on my fancy library,
which I added vector.
I want to make sure it's still going to run on the board
that don't support vector and it's not going to crash.
Well, I can do that.
Or I want to run with vector lengths of 128 bits.
Great.
You do.
You just specify V-Linux 128.
It just works.
256 bits, same thing.
You can go up to 16,000 if you want, but it's poor of two.
Basically, it makes it very easy to test a lot of different
configuration for free on CI.
So that's where I come to the question about performance.
So performance measurement is not different.
It is not psycho-accurate.
It does not even try to be psycho-accurate.
It's just not what's made for.
Usually, psycho-accuracy measure, there is some open source one,
like Gem5, but the vector-specific ones are extremely secret,
obviously, because it contains a lot of information
about the macro-architecture.
So don't ask me.
I will not say any information about our psycho-accuracy.
What you can do, eventually, is interaction count.
The problem is it's very inaccurate.
If you go from 10 instructions to five, but if instruction takes
10 times the latency, yes, you're going to use less instructions
but it's going to take longer.
So that's not perfect.
Better than nothing, but not perfect.
Second bad thing is bolts.
The problem is imaging, optimizing for high-end servers,
like Crashing and Fall, sorry, on the Raspberry Pi.
The preference profile doesn't really match.
Like the bolts we have today, all in order CPUs,
there's few cores, scalability is not very great.
For example, if you go to 64 cores on the board,
which was meant to be four cores, memory accesses are going
to be slow as soon as you load it a bit, so it can be limited.
Also, only one board again supports vectors,
so that's the KNMV K230.
So you cannot even really test vector.
You cannot really test vector monthly-thirty either.
So it's getting better, slowly, but it's getting better.
Yes, please.
I think the vector version is 0.7.
On the KNMV K230, it's the first one that supports vector 1.0.
It's, for example, the Leachie Pi with a 3.910
that only supports vectors 0.7.
And also the optimization manual,
so that's something that we're working on at RISE.
It should be coming in the next few days,
but stay tuned.
And it's kind of a guide of how to write performance codes
with the input of companies who are actually making chips,
and so who knows, okay, this kind of thing is going to work
well on our chip, and so we came together
and we said, okay, these are common guidelines
for writing efficient RISE 5 code.
So I think, South, it's fun, never too late.
Please get involved.
There is so much more work to do.
There is more work that you can imagine.
We have enough work for the whole industry
for the next five to 10 years at least,
so please get involved.
Check out Wikidotrizeproject.dev
if you don't really know where to start.
We are trying to outline some of the work
that we're planning to do and we're welcome contributions.
Also, if you have an idea, make a proposal.
We pay money for that.
As in, if you think that there is something,
like some project that should be ported to RISE 5,
we're ready to sponsor it, so it's paid open source work,
which is not that common.
And finally, thank you to all contributors again.
Without everyone, it would just not be possible.
Applause
Yes, please.
So what about running software for the two years
and testing the...
Softon, can you define softon, please?
So basically, I know very low implementation of the...
Any other, you know, any of the compiling the FPG
and running it on FPG?
Yes, OK.
So it tested with very fast and very key Mew,
but you know, not that fast.
So the question is how about running,
like so literally softon on FPG
basically have a very low guarantee on implementation
compared to FPG and run on that.
We are doing that internally for cycle accuracy.
Even that is way slower than QMU.
Like I think in order of speed today,
it's bolts are still a bit faster than QMU.
QMU, as long as you have a lot of calls,
like 30 to 64 calls is ideal.
FPGA, emulators, this kind of stuff,
and software emulation of RTL.
If you reverse that in terms of accuracy
of the results you are going to get in terms of performance,
software emulation, emulator, FPGA are going to be the most accurate
because they are actually trying to be cycle accurate.
QMU is not trying to be cycle accurate.
The bolts are by definition cycle accurate.
But the big advantage of emulators and simulators
and all of that is you can literally have a trace
of this instruction took that many cycles
and this instruction took that many...
like a stream of every instructions executed.
And so you can have a very, very precise and accurate
representation of this is how my application ran.
Obviously you run like something that would take 10 seconds on XC6,
you generate it for like 300 gigs of data.
But you have very precise and accurate information.
Yes, please.
So my question is more on the hardware vendors now, right?
So if somebody starts to take out today, right,
what should he... because of the RTL development
and then the tape out itself takes a lot of time, right?
So this is... you're looking for one or two years.
So what should somebody start with if you said the greatest today is this?
But if I have a product that would go in this time,
I can have some iteration in between, right?
But the problem is again that something changes in the vector library
and then my entire timing constraint gets discovered.
Yes.
Should I only be starting what you suggest now?
Or can I have some kind of a plan that I can update
six months, eight months, if something new comes?
What is your suggestion for this?
So the question is what should be the target?
If today I want to make a new hardware,
what should be the target profile, for example?
And do I understand correctly that you're also
alluding to the cycle timing of interactions and things like this?
Because it will disturb the logic, right?
Yes.
If the number of cycles for the same thing will be different
if I have a better vector crypt or something coming.
Yeah.
If we look at what's being done, for example,
by the companies who are talking about what's happening, right?
There is Sci-5, there is T-Head, there is Ventana.
Well, let's take these three as examples.
T-Head, as the latest announced, was the C908,
which does support vector 1.0,
but they announced it only like a few weeks
or a month after a vector crypto was announced.
So obviously when they announced it,
it's because they knew it was going to tape out.
And so vector crypto is complex spec,
so they didn't have time to implement it.
Ventana has also announced that they would support
in their second gen chip the vector 1.0.
I think that was announced at RVA Summit last November.
But I don't think they are targeting vector crypto
because same thing, the timing was like when it was announced,
vector crypto just came out or was coming out in next week.
So same thing.
Sci-5, same thing, they announced vector 1.0 chip,
but same thing non-vector crypto.
The expectation is that vector crypto chips
are going to be announced like a year from now,
maybe a year and a half, maybe two years.
We all hope it would be two months from now, but things shift.
If you start the chip today,
I sure hope that you are targeting RVA 23 plus vector crypto.
But that's obviously for starting a chip today,
and that's easy for me to say, but how to...
Knowing that the chip takes...
If you look at Intel's timing, a chip takes 5 to 7 years to take out.
So you start today, you deliver it in 2030.
Does that answer your question or...?
Yeah, it's a soft talk.
So what I get that today, if I start,
I can't just move vector crypto.
If you write after today,
I think you can expect to have boards having it in 2025, 2026.
But that's where Harropov is good for you,
because you can check if Harropov is supported by the hardware,
then use this path, else use the other path.
And that's how it's done in OpenJK.
That's how it's done in OpenSSL, that's how it's done everywhere,
because it's funny how we say hardware has a 5-year lead time.
Software also has, it's not just like,
I have my commit and then everyone has it, right?
You need to have a release and then a release is not enough,
because it needs to be shipped in a distribution,
and then hopefully it's a LTS,
and so you easily have 2 to 3 years lead time in software often.
So it's important that if you are doing crypto stuff like OpenSSL,
do vector crypto today,
because the time that it gets into the hands of like Reda Hat 10 customers,
or Ubuntu 24.0 for anything like this,
well you want to have community of things like a year ago, 2 years ago.
So then these hardware vendors are talking to RISE to get this input,
because you are kind of connecting all the open source.
Sci-Fi is part of RISE, Ventana is part of RISE,
Revis is part of RISE, Alibaba is part of RISE.
It's really, like, we want to push the software forward,
because we also understand that software needs to be ready for,
when we get the bold out, we need people to actually have software to run on it.
So here I talk really about a lot of user space stuff,
but RISE also has interest in kernels, in debugger, in firmware,
in OpenSBI, in emulators, in debuggers, in like everything.
Like we want all the software to work,
and even software and a bit of firmware to work, right?
So we need everything to work physically.
Thank you.
Yes.
I'd like to say that I think you're asking as well about knowing what instructions
to use to target performance improvement in software, right?
So how do I today use the best equipment that we know we're going to be able to do?
So the simple answer is,
Maxi, the compiler again is,
the question is, what's the problem?
Are there any already know what patterns are for the compiler?
Or already know about the appropriate pattern
sort of under hardware?
We will get instructed, and I'll just say you need to do the micro-commodization.
You want to go down to the example, or that's by the optimization going on.
Yes.
So general advice for what to do.
It's not going to be perfect for all architect.
Similar to AR64, training the way for what the RISC-5 software.
You have an AR64 implementation, not too difficult, but RISC-5.
If you have a RISC-5 implementation, which is semi-optimal,
you know you're not going to want these useful rocks,
that contribution into an open-source project,
even if it isn't specific to your micro-architect, it's very valuable.
So just to repeat what Kieran said for the people online.
If you're trying to do some optimizations,
the compiler is provided by the vendors,
usually is going to have vendor-specific information.
Usually things that have not landed upstream yet,
or that have landed upstream in the latest, like Trunk, for example.
So it's not released yet, and so they just want to make sure that they have,
like customers can have access to the best compiler for their,
for whoever buys Ventana or ReVos or Sci-Fi stuff.
And then if you're trying to do really handwritten assembly optimizations,
then that's where you start referring to the optimization manuals,
where there is the rise one,
but you also expect ReVos to have some, Ventana to have some,
Sci-Fi to have some, but they should be very specific to the hardware,
and the rise optimization or RISC-5 optimization manual will be very generic,
and for example, how to best use vector instructions,
like what's to avoid, what's not to avoid,
how to basically best use it across most hardware.
So I think like OpenBLAST in that is very interesting,
because they are planarized on X86, right?
They have 20 or something different kernels for X86,
based on the generation of Intel and based on the generation of AMD,
because they know that the timing of instruction is different based on whatever,
so you can compile for Nehalem or you can compile for Sci-Fi,
or you can compile for Xenversion3 or Xenversion4,
and then the kernels are going to be different on this kind of stuff,
and I think OpenBLAST is very specific in that,
because they really try to extract, yes, it's like a linear algebra library,
but it's, I think it's really pushing the thing very, very, very, very far,
but that's kind of the kind of things that is going to be very vendor specific,
and which you will have to refer to the vendor spec or vendor optimization manual for that.
Thank you for the question.
Yes, please.
Two questions.
So one, when you start boarding open source software to RISC-V,
you probably have to talk to the developers or the maintainers as well.
Yes.
How often do you get the reaction, we have no idea what RISC-V is?
So the question is, if you're trying to port project RISC-V,
what is the, like, how often does it happen that the maintainer says RISC-V,
like, you don't know what it is?
Honestly, pretty rarely.
I think Hacker News has done a lot of great job of,
RISC-V is exciting, RISC-V is new, can't involve in RISC-V,
people don't really know, oh yeah, it's a new architecture, right,
it sounds exciting, right, but they don't really know what it is,
but at least they heard about it.
About three years ago, I was in a talk about Flist,
Yes.
like a variant of OpenBLAST,
and I suggested that we should look at RISC-V,
we had no idea what it was, and these are pretty low level people, right?
Yes.
That was three years ago, but I think it's too late.
Yes, so your comment also, for like,
looking at the beliefs of, like, three years ago you said,
people didn't know what RISC-V was, but hopefully today they know.
Hopefully today or when it works in RISC-V?
Today it works, yeah, so, I think, actually the,
also like, fun tidbit for OpenBLAST, like,
Serge in the room just got the RISC-V branch of OpenBLAST
merged into the trunk branch of OpenBLAST,
so there was some generic support for RISC-V in the developed branch,
but there was some, like, vendor-specific optimization
in the RISC-V branch, and we just merged it, right,
so that's also why, even for things as important as the West,
but I've got things that are evolving fast,
that's why, that's what I mean by,
if you don't know how to get involved,
you know that there's a lot of stuff to get involved in,
and look around any library you generally use,
probably one support RISC-V, so please add support RISC-V to that,
it's going to be nice.
And then another question is, there's different aspects,
there's getting it to build, to compile,
there's, you talked about performance a bit as well,
what about running the test suite,
because that's like, there may be 100,000 tests,
there may be five failing on RISC-V,
fixing those may require a lot of work to get it to actually fully back.
Yes, so the, your comment was, yeah, we can compile,
we can test, but what about the five last tests
which are failing on RISC-V specifically?
Yeah, so either it's assumptions which are taken
which don't hold on RISC-V, that's usually the hard one to figure out,
or, yeah, well the code is generally broken,
but again, like, because it's running on X86,
it was just working, that's why I think having support for
M64, F390, XPPC helps, because the projector really knows
that it's not just X86, and so they developed it in a way
that works everywhere.
QMU in that is very interesting, because, well,
if you're in QMU on X86, the memory model,
they don't try to emulate anything, so it's just the host memory model.
So you test on X86 with QMU, so you have your Raspberry,
based on your memory model, it's just going to work.
You go to a bold, suddenly it's going to fail,
and you don't know why, but it works on RISC-V,
or it works on QMU, well, but the thing is it's TSO on QMU,
weak memory ordering on RISC-V, and that's the harm bug to track.
Yes, please.
I just wanted to add the R-M-I to the exact same issue.
Yes, so the comment was R-M-I is exactly the same problem,
and yes, it's exactly the same issue.
Yes, please.
While porting some software to RISC-V,
I had a problem that didn't appear even on ARM64,
so there are even bits that...
Yes.
...that ARM64 doesn't help.
It was that GP register that isn't on...
Possibly.
It's a long story, but it was related to GP register
that worked differently, that is not on ARM64,
and is on RISC-V, so there are even stories like this.
Just to repeat your comment again for online,
it's not because everything works on ARM64
that everything is going to magically work on RISC-V.
Yes, there are some RISC-V specificities, but yeah.
Yes, please.
I'm just curious, you may be outside of the scope of the effort,
but it seems like there's an opportunity to provide
possibilities for the hardware vendors
in terms of the code that's already forwarded,
the prevalence of specific instructions in the app prices
so that the hardware vendors know what to focus on
in terms of the performance.
So the remark was there is opportunity in looking
at the dynamic traces of instructions for hardware vendors
to figure out what do we need to invest in, and yes.
Yeah.
To shun them, yes, there's a lot of opportunity there.
The question usually for that is more what workload
should we focus on?
And that's a very tough question because obviously it depends
which market you're targeting.
Also given the certain market, there is five different software,
five different frameworks, which one are you looking at?
How do you use the frameworks?
Which library?
But yes.
There's a...
Yeah.
Just further on the general subject of not obvious problems
and QMU are analyzing them.
So the single most useful, not obvious thing with QMU
is a mode for alternative and most effective operations
where it will crash the diagnostic components in the register.
So the remark was QMU has a mode where you can trash...
You can trash or cache?
Trash.
Trash.
Okay.
You can trash the vector register length if I understand...
So for masks and for tails of the mask or the elements...
Okay.
...and elements that have been built into that vector...
Yeah, you can basically put garbage into the tail elements
and other kind of things like that to help you test
so that you will actually see that you are using
not what you expected to use in the vector.
And that just makes your life easier for testing
because it's going to crash instead of just failing silently.
Yes, please.
So the question is for precision and compatibility
do we want QMU in ARM64?
So the fun thing is for adoption that I mentioned at the beginning
like to build Java, we are actually using an ARM64 VM
just because it's cheaper on the cloud.
And yeah, I mean it's compilation
but you run the GCC compiler basically and it just works.
Like there's no question.
So you are stress testing GCC but it works.
For general testing, yes, but for example,
GitHub Actions you only have XC6 machines.
So I think it would be better to test on ARM64
because yeah, you have week memory monitoring,
you have all of that, but it's harder to access.
Yes, please.
Yes, so the comment is with ARM64 SV,
which is the Scalemore vector extension,
like vector length agnostic extension,
it will help with five to basically let the world know
that there is not just vector and yes, absolutely.
Like I mentioned in the slide right after,
but there is a project called XCMD
which is a project called XCMD.
And it's a project called XCMD.
And it's a project called XCMD.
There is a project called XCMD
which is kind of abstraction over vector stuff.
And yes, it assumes that the vector is 128, 512, or 256
and you don't have a choice.
So feeling resiving that is a bit painful.
Yes, I think there was.
Yeah, sorry.
Okay.
Any other questions?
Okay, well, thank you everyone.
Thank you.