All right.
We're ready for the next talk.
So Mark is going to explain us how to overcome MPI, ABI compatibility with WM for MPI.
Yes, can you read me?
Great.
So, hi, my name is Mark.
I'm working at CEA.
And today I'll talk about how to overcome the MPI, ABI incompatibility using WM for MPI.
Just a couple of words about CEA.
Can you read me?
Yes.
We are a French organization.
We are hosting a computer.
We have big supercomputers that are used at the national and European level.
And in a couple of years, we'll host one of the two exhaust scale systems in Europe.
So I start with a quick introduction on the ABI MPI incompatibility issue.
Why we actually care about this problem.
I will try to give you an insight of what is the problem actually.
And then I will show how we can overcome this issue by dynamically translating between
different implementation.
And I will show you some use cases before concluding.
So first of all, why do we need MPI library portability?
There are a few reasons that are motivating for us.
The first one is to be able to work around the limitations of an MPI library.
So as you may know, MPI is a norm.
There are different implementations.
And as all software libraries, they may have limitations.
They may have bugs.
So to be able to switch between libraries can be interesting to actually diagnose the source
of a problem.
It's interesting at the level of a user.
It's also interesting at the level of managing the supercomputers or at the level of the developers,
of course.
It can help you also to choose the best MPI implementation because as you also may know,
between the different implementation, you won't have necessarily the same algorithm to
do a specific communication.
And on a specific cluster, some MPI implementations may be better optimized than others.
And it's interesting to be able to test easily all those things.
It's also useful to enable fast and portable containers because what containers offers
or claim to offer is the flexibility and portability.
And it is almost true, but there is a problem in HPC that is you will have a loss of portability
if you need to match the host MPI.
And in almost all cases, you will need to match this host MPI because it's optimized,
it's vendor optimized, and you have no other choice than doing so.
And it's not so probable to have the right underlying libraries.
Sometimes they're even closed in your containers.
It can help you also to add flexibility to high-level languages.
Some high-level languages like Julia or Python at some point depended on a specific MPI library.
So if you want to use another one, you're stuck.
And also sometimes you will build very complex software stack with those languages and if
you want to switch the library and you need to rebuild everything, it's really, really
time-consuming.
And if you can switch easily, that's also interesting.
And the last point is to be able to run a bidding-edge system or early access systems
because in many cases on state-of-the-art systems, you will have a single vendor optimized
library.
So if you have your application already compiled with another implementation, you'll be stuck
once again.
And that's sometimes something we saw with cloud providers, some cloud providers provide
a single vendor optimized MPI library and that's it.
So now let's talk about the ABI compatibility in MPI and why is it a problem?
So MPI, as I said earlier, is a norm and the norm actually defines a single API, which
is great, but it has several ABIs.
At the moment, there are at least open MPI, MPH, and all the MPH-based library or MPH-compatible
implementations, MPC.
And if you go back in time, there are others also that exist.
And the problem is there are in general ABI incompatible because even for the simplest
element of an MPI library, it won't be implemented in the same way.
If you look at just a communicator, which is the very basics for MPI, within MPH it's
an integer and within open MPI, it's a struct.
So you have no way of going from one to the other.
And it means that if you want to switch from between those libraries, you will need to
recompile.
And sometimes it's possible, sometimes it is not, as I said, because you could have a
very complex software stack and it can take literally days to recompile everything.
And sometimes you're stuck with proprietary software.
Even though we are at first-dem, those software exists and we have to live with them on our
clusters.
So how to do that?
That was the motivation to create Wayfarr MPI, which is a library that allows us to switch
between MPI libraries.
The idea is to catch the calls from a library and to translate them.
So we have the input arguments.
We will translate them from the original library to the destination ABI.
We will call the function from the destination ABI.
And then we will go the other way around.
We will translate the output arguments and the return value from the destination ABI
to the original ABI.
There is just one catch.
That is, in some MPI functions, you have actually a call to another MPI function.
So you have to ensure that you're not already in an MPI function and to avoid to re-translate
functions.
Because if you do that, it will crash.
So we have an assembly code selector to deal with this issue.
And as you may also know, there are literally hundreds of MPI functions.
So to deal with those translations, the functions that are doing the translation are actually
generated.
So we have templates.
We have files to define the different functions and the input arguments and so on.
And we generate everything to be a bit more robust.
So now how to use Y4MPI.
One of the great things with Y4MPI is that you have two available modes.
The first one is the preload mode.
And this one is quite interesting if you already have a software that is compiled.
Because you can dynamically, at runtime, translate between MPI implementations.
So imagine that you have a software that was compiled with MPH.
At runtime, you can use OpenMPI.
And we have an interface mode.
So now we are using Y4MPI as a stub implementation.
So we will compile the code against Y4MPI.
And it is at runtime that we will choose which MPI we want to use.
So this one, if you know that you will have Y4MPI at end on the cluster you will be using,
it will ensure you greater portability.
From an installation point of view, it's really simple.
It's a basic CMake-based installation.
And Y4MPI is also available through the SPAC package manager.
So if you're using SPAC, just SPAC install Y4MPI and you're all done.
And in practice, how you can use it.
So there is at least two ways of using it.
The first one is to use it directly as a wrapper.
So imagine you're using slurm.
So you will call slurm.
In general, you will then put the binary you want to launch.
And you will just add the wrapper saying that you want to go from one implementation to
the other implementation.
Another way of doing it is to use it transparently using environment variables.
And here the main catch actually is to have the right LD preload because you will inject
the Y4MPI library at runtime to catch the MPI calls and to translate them.
And then if you do that, you can run directly your app with this run.
Don't worry about all those variables.
They are, of course, documented.
And if the translation work, the only thing that will differ from a normal execution is
that you will have a message saying, hey, you're using Y4MPI in the preload mode from
one implementation to the other.
So for more advanced Qsage, I invite you to read the documentation.
We have a bunch of tutorials.
So it's available online.
At the moment, we have seven main tutorials.
One, well, we have few basic tutorials about the installation of Y4MPI, how to translate
MPI dynamically using either the preload or the interface mode.
And a few examples with Python, with Gromax, which is a molecular dynamic code that is heavily
used in the HPC community.
And we have examples with containers.
The tutorials we have at the moment with containers are using Podman.
The only reason why we use Podman is that it's the easiest to install almost anywhere.
You don't need to have root privileges to use it.
But even though it's specific to Podman, actually, the ID stays the same.
So if you have another container runtime that you like that is not Podman, you can apply
what we did with Podman on your specific use case.
So Y4MPI started in 2016 at CEA, and it's still in Active Development.
It's, of course, open source.
And it's under your license, CCLB plus BSD3, to be compliant with the French and more
international law.
All our developments are validated using a CI that is using well-established benchmarks,
especially MPI benchmark, like the OSU benchmark, IOR, and more user-oriented benchmark as
Gromax.
And it's also an ongoing collaboration between CEA and the Laurence Livermore Laboratory
in the US that started in 2020.
And we have publications together.
And last year, we did a tutorial at ISC, and we hope to host another tutorial this year
at ISC in May.
Regarding the support and the limitations of our library, so we're supporting X86 and
ARM architecture, GNU, Linux, and BSD.
It was tested recently on 3BSD.
We are supporting also, of course, CN4TRAIN and 3.2MPI norm.
In terms of the limitation, for it to work, you need to have a dynamic linking.
If you compiled your code statically, it won't work.
It's better if you can avoid a circumvent air path, because our idea is to inject the
library at runtime.
We are not supporting the timeout feature on BSD distribution.
Because so we have, so we added few features in WeforMPA that are not defined in the MPI
norm, and in particular, we have this timeout feature that allows you to set a timeout on
a specific MPI code.
And that is very interesting for debugging, actually.
For the transition of some constants that are defining the max length of some strings,
you may have truncation, but with all the tests we did so far, it was not an issue per se.
And lastly, the MPIX functions that are the experimental functions implemented in the
difference MPI implementation.
We are dealing with those functions on a case-by-case basis.
So they are all supported, but we started to support few of them.
So now to give you a glimpse of what you can do with it.
The first examples I wanted to show you is something that happened to us that we had
a GROMACS version.
So GROMACS once again is a molecular dynamic code used in HPC.
This version was compiled against MPI.
And on the target cluster we were using, MPI could run only on GPU, and we had an error
on CPU.
So this is due to the fact that in GROMACS you have a call to the MPIX query code support
that is just checking if you have GPUs.
But actually the way it's implemented in MPIX is that if you don't have GPUs, it crashes.
In other implementation, it's not doing that.
It's just telling you, no, you have no CPU, sorry, and then you can do whatever you want.
But the way they did it for no in MPIX is that it crashes.
So we couldn't run this code on CPU.
And so we used Wi-Fi MPI to run the code to go from MPI to open MPI.
And here I see the results.
So using MPIX, it fails.
Using Wi-Fi MPI, we have some performance.
And actually we recompile the version of the code using the open MPI.
And we have performances that are really similar.
And the other use case I wanted to show you is with containers.
So the idea is to have an MPIX-based container in which we have an OSU microbenchmark that
was compiled.
And we did some comparison between two AMD mail-in nodes at TGCC, which is one of our
clusters at CE.
And in all the things I will show you, I compare an execution using open MPI directly on the
cluster with the OSU microbenchmark.
And one execution using a container in which we have open MPI available that we are plugging
on the open MPI of the cluster.
And the last one is a container with MPI.
And we are using open MPI using Wi-Fi MPI.
So the first graph here is to show the in-it time between those three cases.
And you see that it's very comparable.
We have the same results.
It takes the same time to in-it MPI with Wi-Fi MPI.
Here it's bidirectional bandwidth.
And it's the same.
The results are very comparable.
And another example is an old reduce.
So here all the cores of the two nodes we used were participating to the communication.
And we have very comparable latencies between those three cases.
So the good point is that the overhead of Wi-Fi MPI and those tests is really minimum.
So now in conclusion, for the future, the good news for HPC and the bad news for us is that
there is a standardizing project for the ABI layer that started last year.
And it's really great because it will greatly help all the HPC users.
So there will be very likely a common CBI defined in one of the next norm.
You can refer to the Haman et al paper from 2023.
And we actually can reasonably hope for a convergence because nowadays there are actually
two ABI that covers more than 90% of the HPC platforms that are MPI and OpenMPI.
And the plan is to have a single feature ABI-only release for MPI 4.2.
At some point they were talking about MPI 5, but finally in India it should be for 4.2.
And they are hoping for a draft for SE 24, so more or less one year for now.
There is already a prototype available in MPI.
And there is also lots of ideas regarding this common ABI that are implemented in Mukatuva
from Jeff Hammond.
So I put in the links if you want to have a look at it.
And if you want to have more info on that, you can check the MPI ABI working group also
on GitHub.
And the good thing with this standardizing effort is that the Wi-Fi MPI is actually seated
as a reference implementation.
And so to conclude, for Wi-Fi MPI in Ananshel, this library helps you to switch between different
MPI libraries.
So it allows greater portability and a better flexibility for HPC applications, including
a containerized app or proprietary software.
Its usage is mostly transparent.
So in most cases we studied so far there were no significant overhead, which is also a good
thing.
And the library is still evolving.
So in the years to come we hope to have an MPI 4 support.
We would like to support the Muk ABI, which is the Jeff Hammond project that should be
close to the common ABI defined in the future norm.
And of course the project is open, so we are waiting for your contribution.
And in conclusion, I want also to thank all the people who contributed to Wi-Fi MPI, and
especially at CEA, at the Lawrence Livermore Laboratory, and at EOLN, which is a company
with which we are working.
Thank you all.
Okay, perfect timing.
Questions?
Yes, thank you for the last presentation, and I have several questions.
And trust me, which version do you support?
So we are supporting the idea to support the Fortran API of MPI.
So we have no limitation in terms of Fortran supported.
But it's not the part that is the most tested.
So if you have Fortran use cases, you're welcome to try it.
And if it works, great.
If you have any issues, open issues on GitHub, we try to be very proactive on the GitHub
issues.
Okay, thank you.
The next question is about ILP64 ABI.
So because a lot of programs compiled in ILP64 mode, I mean, for example, it's like FDFALT
integer 8.
So did you support this feature?
We're supporting once again, if it's working, the idea is that if it's working with your
MPI implementation, if you can actually do it with your MPI implementation, if you have
a compatible MPI implementation, it should work.
But some cases where you need to have a right, you have to have kind of the same way of compiling
your library and the ABI.
You probably didn't understand the problem that initially what a lot of implementation
has two wrappers already.
So you have initially you put in some wrapper which translates from ILP64 to LP64.
And then only after this it will do some traditional calls.
So do you support this feature?
I'm not entirely sure.
We should discuss it afterwards.
Yeah, and the last question is about runtime because, for example, we have some additional
parameters in MPI run and some programs like ORCA or MRCC.
Did you just do internal calls of runtime, I mean just exact of MPI run?
Did you do something with this feature?
I mean, because, for example, Intel MPI supports some additional flags and the OpenMPI also
supports some additional flags and there is different between these two implementations.
So, same thing I'm not entirely sure because for quite some time we actually didn't support
MPI run for this kind of question and also because on our clusters the norm is to use
SRAM directly.
So there were a few things with this regard that we didn't support it.
So sometimes the Google started to support MPI run but I'm not entirely sure of the
level of support we have on specific things you could find on one implementation versus
another.
Thank you.
More questions?
Hi, cheers for that.
I wanted to ask.
I want to mention some example programs and benchmarks that have performance portability
here.
What HPC applications are you targeting with this that require the performance portability?
Because if it's a Gromax application that wants to use all of the system, it's typically
that code will run for years on the system and be optimized for it anyway.
So at which point do you need something that works on multiple systems that's okay and
at which point do you need to have something that's very specialized so you wouldn't actually
need to have a wrapper in between?
I'm not sure to understand your question.
So what applications use this in the real world on the clusters that you're working
on?
So none, example programs, what actual programs require the translation from one MPI to another?
The only one in which we really had to use the wrapper and we have no other choice were
commercial softwares.
And especially when I talked about bleeding age system, we have some systems with BXI
interconnect.
And BXI interconnect is a evident interconnect.
And they were supporting only open MPI.
The thing is some commercial vendors distribute their softwares built against a specific MPI
implementation.
And they have no interest of supporting another one.
And so if we wanted to be able to use for real those commercial softwares on our system,
we had no other choice than using Wi-Fi MPI.
That is the only case in which is really, really mandatory.
For other cases, in most cases, it's actually more, it's more comfortable in some sense
because it helps you to debug quickly.
It helps you to test things more easily.
But if you have an infinite amount of time in front of you, you can do anything.
You can always recompile everything.
But we don't have infinite amount of time in front of us.
OK, thank you.
Any more questions?
I have a quick one.
Have you looked at MPI trampoline?
That's a very similar project to what?
Yes, MPI trampoline appeared a few years ago.
And the main difference between MPI trampoline and Wi-Fi MPI is that MPI trampoline allows
you only what we call the interface mode in Wi-Fi MPI.
So you can compile against MPI trampoline and then use another MPI implementation.
But you can't bring your own code and run it directly with MPI trampoline.
But yes, in the past few years, there were a few ideas similar to what we did with Wi-Fi MPI.
And it's interesting because for quite some time, people didn't really care about this
API incompatibility issue.
And we see now that there is some interest.
And that's also why at some point at the MPI forum, they decided that it was time to have
a common API.
So that's great to see other projects like that emerging.
All right.
I think that's it.
Thank you very much.
And if people like the project, there are some stickers here in front as well.