Okay, this is going to be interesting.
We are relying on the Wi-Fi bit here as well.
So it would actually help if you turn off your Wi-Fi.
I know that's a big ask.
Consider that for the next half an hour.
That would be really helpful.
So Vanessa is live here through a video call.
Give us away, Vanessa.
We can...
Well, can you try speaking?
What's up, folks?
Sorry, I'm not working.
Is that working?
Try again?
Still what's up, son of them?
Okay, that's really better.
Nice.
So we'll start your recording, Vanessa,
and then we'll try and do live Q&A at the end.
Sounds good.
I have some answers for the previous Q&A, too,
so we can talk a little bit about that.
We can try.
We can try.
By the way, Vanessa is also the one who designed
the HPC social logo.
So you should thank her for that
and take some stickers when you leave.
Thank you.
Thank you.
All right, here comes the talk.
Hi, folks.
I'm Vanessa Socket,
and today we're going to be talking about
Kubernetes and HPC,
the bare metal bros.
So I thought I would open this talk
by putting two words on the slide
and then I'll go to the next question.
So, what is the question
that you guys have been asking
or very anxious?
Those words are cloud and HPC.
So probably the question on everyone's mind
is what does the future look like?
I'm going to answer this question
by posing a question back to you.
Where is the money going?
So we can look at polls
from Gartner and Hyperion Research
that suggests that cloud
is projected to reach
$40 billion by 2026
with a smaller CGR of 6.4%.
So very superficially speaking,
the money is going to cloud.
Now, we can also then follow up on this question
like, okay, that's great,
but who's going to get left behind?
We can look at a paper
from Reed Gannon and Degar from 2023
that identified some really interesting trends.
For HPC, it suggested
that the way that we design our system
will not be a problem
because we're not going to be
able to design our system
will not continue to work.
We cannot depend on dentered scaling
and Moore's law.
There's increasing rising costs
for improved semiconductors.
This is going to make it harder
and increasingly more expensive
and laborious to deploy new systems.
And they define something called NREs
or Non-Reoccurring Engineering Costs
that we are incurring for every new system.
Now, cloud, on the other hand,
is leading the space of innovation.
As we know, there's this massive expansion
of large-scale commercial clouds.
They are not depending
on software vendors or hardware vendors.
They're making their own stuff in-house.
And guess what?
They're hiring away and attracting the talent pool.
And they made a really interesting analogy
with temperature.
They described HPC at endothermic
requiring the absorption of heat
for survival.
And cloud is exothermic
and really giving off of heat.
And we know that, folks,
we're not talking about heat here.
We are talking about money.
But to continue the heat analogy,
you'll know that if you've ever been
out in the snow
in a cold environment,
you are much more likely
going to be wanting to give off heat
to survive.
So who gets left behind?
Well, the person that needs
to constantly absorb heat
that's probably going to run out
is the person that needs to absorb heat.
And that's the reason
that we're all here.
It's because we need to ensure
that the needs of our science
are represented in this new environment.
And guess what?
The success of our science,
the reason that we're all here,
really depends on our ability
to be collaborative
in this space.
And so this is really kind of
the manifesto of Converge Computing.
So if we bring them together,
we get this new technology space
where we have the best of both worlds.
So where do we start?
Well, here is how the talk
is going to proceed today.
We're going to start with models for convergence,
talking about patterns
for bringing together traditionally disparate environments.
We're then going to move into strategies
for convergence.
So designs that I've noticed
allow for easy movement
between the spaces.
So let's start with
those models for convergence.
Now, if you've looked in paper land,
you've probably seen many different models.
There's many different ways
to take HPC and cloud
and put them together.
I'm going to talk about the high-level patterns
and from the perspective of someone
that's maybe deploying a system.
So let's say that's me,
and let's say I want my cloud and HPC,
I'm going to take my limited set of resources
and I'm going to try to split them
into two steps.
So I spend a ton of money and I do this,
and then,
I chose poorly.
No one's using half my resources,
and oh my god, so four years later
I come back and I'm like, all right,
I want cloud, X or HPC exclusive or HPC.
I understand I can't have my cake and you to choose,
so I am just going to choose one.
We've used HPC for all these years,
red and butter, this is why you've always done things.
I choose HPC.
Great, six months later, someone comes into my office.
Are we dinosaurs?
You know, everyone over there is using YAML and automation
and we have this old setup and
ah, so you go back in your office,
you contemplate your life choice and you're like,
oh right, no, it's okay, I'm not going to wait another four years.
I'm going to sneak it in.
So this is where you see all of these ideas,
like bursting, multi-cluster,
and these are generally referring to this idea
of having some home base of resources
and reaching out to get more.
And the problem with this approach as I see it
is that the complexity of these approaches
often reflects the complexity of the systems.
So they tend to be snowflake,
they tend to be complex, and this is why there hasn't been
like a single leader that has emerged in the space.
So here is a different idea
that's less common because it doesn't
superficially kind of make sense.
I want cloud or HPC, meaning I want to be able to run HPC,
or cloud,
or at the same time,
or something together that's more converged,
like what the heck
am I talking about,
don't I?
Am I talking about,
don't worry, we'll talk about it.
Let's first talk about strategies for convergence.
So these strategies
I need to point out, these are not just about the technology,
they are also about the people
which is often harder.
The first is common goals.
In order to get two different communities working together,
they have to care about the same things.
You can't get around that.
The second is modularity.
So the degree to which your application or infrastructure
can be modular,
is that you can use things interchangeably
and swap them, be very creative.
The third is integration.
This is consumption of an entire thing
in another thing by way of different strategies.
So let me give you some examples.
For goals,
the best overlap of goals I've seen
is with respect to batch workloads.
So a few years ago, the Kubernetes community
started the batch working group,
and this was because this new need to have
AI ML workloads in Kubernetes.
Traditionally, Kubernetes is where you run services,
you keep something running.
And there wasn't this concept of starting something
and having it complete, but all of a sudden
there was this new need, and guess what?
We have been doing that in HPC land
for like a couple of decades now.
Modularity, a really great example,
is actually with Kubernetes and Flux Framework.
So you may think of Flux as just like
this workload manager,
but actually it's called a framework
because we assemble
many different components together
to assemble into the workload manager
known as Flux.
Kubernetes is the same,
different set of components,
and there is going to be a creative way
that we can kind of use these interchangeably.
So the final example of integration,
the best technologies I can provide
are containers and language bindings.
Container technologies are literally this vehicle
to let you move between spaces,
and language bindings are going to let you take it traditionally
like C++ HPC project
and extend it into a language
that is native to the language
and extend it into a language
that is native to cloud.
So for example, Go.
Alrighty, let's get into some examples
just like eggs three ways.
Here are some projects that we've actually been working on
at the lab. The first is Fluids.
As I alluded to, this is the Flux scheduler,
swapped with Coop scheduler.
The next is the Flux operator,
the entirety of Flux Framework
implemented inside of Kubernetes.
And then the namesake
of this talk about air battle grows,
Flux and Kubernetes working side by side.
So let's start
with the Flux scheduler within Kubernetes.
You may be familiar with Kubernetes when you launch a job.
You ask for a certain number of resources
that's given to the scheduler.
The scheduler says, okay, here are four pods.
Have a nice day.
So what we're going to do is bring in Fluents.
So our C++ package, FluxSched,
that is mapped with Go bindings
into a custom scheduler plugin.
We're going to swap it.
And so you're basically going to be asking
for the same amount of resources,
but the scheduling is going to be done
by FluxSched.
How does this do?
Well, we find that the workflows run three times faster.
So what you're seeing here is Coop scheduler
on the top, Fluents on the bottom.
You see a lot of randomness with respect to
how Coop scheduler places jobs.
What this leads to is a pathological scheduling pattern.
So anywhere you see a red box on there,
that is a startup delay.
And what that means in practice is though,
is that although the workloads themselves
run in similar times,
we have a lot of outliers.
We have a lot of jobs that take a really long time
to get started.
And so Fluents improves upon us.
So Fluents is a really great example
of modularity because we're taking
an HPC technology
and we're literally swapping it.
And the modularity of the software allows for that.
It's also a great example of integration.
Because we have those Go bindings,
we can speak the language
of the cloud need of communities.
Alrighty, next project, the Flex Operator.
Super cool.
All the gophers in Flexland are pretty cool.
Alright, so the Flex Operator
is implementing the entirety of Flex
framework inside of Kubernetes,
your own HPC cluster.
This happens by way of a custom resource
definition of CRD, where you basically
give all the parameters that you want
for your cluster, whether that's a single job
or whether you want an interactive cluster.
This creates what we call the mini cluster,
which, you know, Flex Operator
is a mini cluster, which, you know,
Flux doesn't know the difference that it's running
in Kubernetes versus on bare metal.
There's a lead broker that's connected
to several follower brokers.
So here you have one pod for one physical node.
The tree based overlay network
within each pod or node,
you have Flux that's added on the fly to your application.
And the Operator is just going to basically
reconcile until the state
that you need for your cluster matches
the actual state of the cluster.
How well does it do?
We added it to the best in the space last year.
The MPI Operator and the Flux Operator
consistently outperformed the MPI Operator
we believe because of the 0MQ bootstrap.
So the Flux Operator
is a beautiful example of integration
because we're taking the entirety of Flux
framework and implementing it inside of Kubernetes.
Bro, bro, bro, is it time
for the bare metal bro?
Yeah!
Okay, so, warning.
I've been saying bare metal,
but nobody's going to give me bare metal.
Let's be frank about that.
So I was using virtual machine.
We're using virtual machine as a proxy
for bare metal.
So just a warning.
So what's different about this picture?
The orange is on the outside.
So we actually have Flux framework on the outside
spinning up a Kubernetes cluster
and notice that we actually still have
compute running on bare metal alongside Kubernetes.
How's that possible?
Don't worry, I'll tell you.
So why do we need this in the first place?
As you know, also,
there are increasingly more
complex heterogeneous workloads
that are coming to HPC.
So this means not just, you know,
embarrassingly parallel stuff,
but also adding in services,
databases, task queues.
Ah!
Okay, so I was...
This slide is not wrong.
I was going to give you an example
of such a workload, and apparently this slide
is giving you this warning that I'm a bad scientist
and I'm not wrong, but I will point out
that my example is actually a very good example
that is a prototype for this kind of design.
Let's talk about that.
So let's say that we're running simulations.
We're training examples
one through N, whatever, doesn't matter,
and we want to send them
to a machine learning server, a specific endpoint
to do the training.
We then want to wait till some metric of goodness
or perhaps a number of samples,
and then we want to flip it around.
We want to run simulations again,
but we want to instead give this
to our machine learning server
without the actual values,
then we're going to have a vector of the true values
and the predictions, and we're going to see how well we did.
Now, very superficially,
if we match this to HPC
versus Kubernetes, this is how we do it.
We would expect that the simulations
would run better on bare metal,
and the service thing would run better
in user netties or Kubernetes.
This is way to be...
We need to prove to ourselves first.
So a lot of you are probably out there like,
user net, like, Kubernetes?
Like, in user things, are you nuts?
I'm not nuts.
There's actually something called user netties.
It's a Kubernetes enhancement proposal
or CUP proposal in 2022
by a very talented developer named
Akihiro Sudo.
Akihiro must point out
won the top maintainer award
for KUKON last year.
He's an incredibly talented developer.
If you've used any of these technologies,
he's the one behind it.
Hats off to Akihiro.
So last year, at the beginning of the year,
user netties was really a hodgepodge
with kind of bash grips. It was really hard to use.
So I engaged with Akihiro
and we released Generation 2
of user netties in September.
And guess what?
It is using containerization, which is really great.
It has these components that we'll go into
in more detail.
So what does it mean in practice?
Well, it means when you're building a virtual machine,
you need to have C groups version to enable.
I recommend LIMA or Linux virtual machines
if you're prototyping this for the first time.
It also means that you need
to enable these kernel modules.
So very generally speaking, the RNet filter
is going to allow you to apply IP tables, rules, bridge traffic.
VXLan is going to allow you to connect VXLan devices
on different hosts to a standalone bridge.
This is important because we actually have
different physical nodes.
Now it's going to use RULE stocker.
This isn't such a crazy idea anymore.
Many clusters have podmin these days.
And so what does it mean?
Actually, when you bring out these VMs,
it means that you're going to run a make up command
that has two contexts.
So both of them are going to build and start a base image
that is using kind, kubernetes,
and Docker with CNI plugins.
And then the two contexts are the control plane
and the worker.
The control plane is going to install Flano,
run kubernetes, and admit.
This makes a joint command which is basically a token
that you give to the workers,
and then the togers can authenticate and join the cluster.
And so that's what they do.
They're just like, I'm ready to serve.
All right, so we created this garbage cluster
small and mighty using Overt and Ansible.
It is small and mighty because each has
eight cores and 30 MBs RAM
and a 10-NVVD iterate.
And I want to point out that we have
seven nodes here because generally speaking,
we're going to have six that we run
things with compute on and one's going to be
an admin node or control plane.
Again,
warning, not bare metal,
you get the deal.
All right, so what's in these VMs when we bring them up?
We have a complete system
install a flux,
singularity on bare metal for reasons I'll tell you a little bit.
Lamps installed on bare metal
and of course user netties ready to be brought up.
So once I shell into these VMs,
my flux cluster is ready to go.
I can do flux resource list and I can see all my nodes.
And user netties, again,
that administrative node is also a control plane.
So we technically have six nodes to work with.
And then we have a user netties.
So we technically have six nodes to work with.
And we can still see them with
coop control get nodes.
Here's what we're working with.
User netties and flux running side by side
the bare metal bros.
All right, bro, bro, what experiments
do we want to run all of them, bro?
All right.
So we first need to sanity check
that what I said earlier about the bare metal
and lamps and the simulations is actually true.
We need to look at
application performance between flux
and user netties.
So the way we're going to do that is by running a few things.
We're first going to run lamps on bare metal with flux.
We're then going to do the same thing
but in a singularity container.
And I did this just to demonstrate that you don't
lose anything by using containers.
Here's great.
We're then going to run lamps in user netties
with the flux operator.
And then finally we're going to repeat cases one and two,
but with user netties running in the background
to look to see if there's any overhead of that.
And I need to pause for a second
because I know how incredibly cool this third case is.
We have
flux on the outside.
Flux is running
user netties.
Within that we are launching the flux operator
which is bringing up another instance of flux
and inside there is where lamps is running.
So folks, like I know Thanksgiving is over
but this is the ultimate
production.
And we expect lamps to be slower in user netties
because as we know it makes MPI collective calls.
User netties are using
something called SERP 4.NET NS
which requires additional processing of packets
with a tap device. I have a great paper I can share
if you're interested in learning more about that.
So drumroll the results
as we expected the
well actually maybe we didn't expect but guess what
the bare metal case is the singularity container
is very comparable to actual bare metal.
I was very surprised by this.
So user netties does not
add a lot of overhead.
And this is what we'd expected that guy up there
running in user netties
is about twice as slow as running
on bare metal.
So what did we learn?
Well, we learned that for a setup like this
the network sensitive stuff
probably should be run on the HPC.
But I'll point out
there's opportunity for improving this in user netties.
If you have experience with networking
I'd like you to go over to the GitHub
right now and I'm just going to wait a lot for the talk
and engage with that to hear it to work on this problem.
Now the next thing we want
to look at is distributed machine learning
specifically two cases
one distributed to across six nodes
and then the second on one node
so the distributed case network is a
variable and for the one node obviously
network is not a variable.
Drum roll results same thing
it's about twice as fast
on bare metal or
twice as slow I guess on user
netties.
And interestingly when you look at just a
single node these are really
comparable so there's no issue with running
something on a single node in user netties
in and of itself it's really when you bring
in the networking that it becomes a variable.
So it's a network right
well let's sanity check one more thing
here's I per thing we did one bit of
transfer for each node as a client to each node
as a server we see bit rate and give you bits
per second is between 10 and 30 for bare metal
user netties with like
non detectable closest here are really really terrible
we can look so we can see the same patterns
for transfer gigabits per second
and so yes
it's the network we're pretty confident
for the setup it's the network.
All right can we do the fun workflow now
we absolutely can so guess
what I actually prototyped this kind
of workflow because I was really excited about it
and so what we're going to do is we're going to
be launching a batch job with flux
batch this means the flux instance
that's only by the running user it's going
to scope resources using hw lock
in this backshot where we can
basically bring up and tear down all
of user netties.
We're going to take that workflow that I mentioned before
we're going to map it into our star track cluster
space so we're going to run simulations with lamps
randomly selecting the problem
sizes predict well time we're then
going to bring up a machine learning server a
special server I made using river a few years
ago and then we're going to
basically do the test cases we're going to run
lamps again but we're going to leave
out the actual well time and we're
going to ask our models what it is
and we're going to do
a thousand training samples and 250
testing samples.
How do we do?
I put no thought into these
particular models but I did
three kinds of regression
the Bayesian and sampling from a probability distribution
didn't do super well but for the
first two there's an actual
kind of pattern between the predicted and the
actual time and so although
I put no thought into this I was really pleased
with this result to see that the general prototype
this idea of having bare
middle simulations running alongside a service
there is something
here we can do science
this way with actual real scientific questions
and I'll point out that there are
real heterogeneous workloads out in the wild
and you this capability here's Moomi
the massively parallel multi-stale machine
learn model infrastructure
and this is basically simulating biological systems
the interact between proteins and plasma
membrane I'll also point out that the
Moomins are what it's based on
the name the finished book comic book series
with really cute hippos with often yellow
spiky hair very awesome
so this is the perfect example
the bare metal rows of coexistence
adopting technologies to make it possible
to go to coexist and
continuing to improve upon them so that
for example with networking
this environment can get even better
so what should you
remember from this talk if you take nothing
else away the first is looking out
for opportunities for collaboration
look for that alignment of goals between
spaces that's an opportunity
the second is providing handles for your
components so you don't have the bandwidth
to look for opportunities add some
go bindings to C++ projects because
someone else could find you
the third is engagement
we need to show up at the table
we need to go to working groups, conferences
places that you haven't traditionally been
to engage in to find these opportunities
for collaboration and possibly
the most important is this
mindset we've had this mindset
of cloud versus HPC
that one has to win
but they're different for so long
we need to throw that away
and get rid of the adversarial thinking
and have a more collaborative
mindset this
is the vision that we have for the future
for converge computing
and we hope that you like to join us
so thank you
that's how to reach me my email
and social networks and here's some interesting links
for the flux and the various projects
I think I will take some questions
virtually now
okay we can take a couple of questions
it seems like the wifi is stable enough
to let Vanessa answer them
do we have any questions
okay
so Vanessa we may have to repeat
a question for you we'll see how that works
hi Vanessa amazing talk
congrats
so I was wondering
if your architecture
can support sidecars
because one of the nightmares I had
when I was trying to do something
similar was that in order
to get the sidecars running
I had to spin up a second
network stack and that created a lot of overhead
so
no
no just one is on
okay did you get the question Vanessa
no I didn't hear the question at all
neither did I
yeah maybe that's better
okay let's do it like this
you'll come up front and ask it here
yeah that's perfect that'd be great
I can hear you great
hi there
hi
so I was wondering if your architecture
can support sidecar containers
because as I was saying
when I was trying to do something similar
when I
tried to create the sidecars I had to
create a second network stack
within singularity so the
network overhead was
amazingly high
so absolutely a flux
operator actually uses a sidecar container
on a net container which is similar
in concept to add flux on the fly
as a view what's going on
in Kubernetes is sort of a different
thing than the networking
issue so the short answer is yes
to kind of add to that though I'm not sure
that singularity and Kubernetes
singularity as the container
runtime for Kubernetes would work
I have never tried that but it
doesn't sound like it would work
yeah it needs to be done
yeah exactly
hi Vanessa thank you
hi it was the most fun presentation
on the post then so far
thank you
so when you were saying that
the main difference between
performance between EBM and bare metal
workloads was related to network
was that the case
also for distributed
training and if that's the case
were you using infini band or not
so this we did
not have infini band and you make a really
good point that this kind of setup would
need to be tested with an actually great
network and that is still a very big
challenge even for cloud
so for example if you use AWS you can bring
the elastic fiber adapter which will give you
great networking performance but if you go
to other clouds and I don't have to name this specifically
you tend to only get really good networks
when it comes to using like TPUs
or GPUs the exception
though is Azure which has a lot of really great
HPC stuff kind of
built in
so absolutely
you can get that setup with infini band
Hi thank you for your talk
I had a smile on my face the whole time
thank you for having such high energy
at the end of the day
what was I going to say
oh yeah so probably in my workloads
I can reduce the
network traffic by a very
large margin if I can
constrain certain jobs to
specific nodes because
then large files don't have to be moved
for certain jobs to
across the network is that something that you
could keep in mind
so if you remember the very quick
machine learning experiment that we showed when we're running something on one node
and you're not using the network
there's no issue so if you're just running
something on one node in user netties
you won't have an issue in a degree to
which you can reduce anything that uses
network so moving data
MPI etc etc you will get
similar performance at least from this small
prototype experiment that we've seen
as you would on bare metal
I have to do this because it wasn't really bare metal
thanks
one more question
hey Vanessa that's Danny
I'm gonna die my hair
soon so you won't recognize me again
I really liked your framing
actually I thought I was going to
sort of being adversarial and then I actually
realized what you were saying and I really appreciated it
however though regarding the
adversarial framing I have
some experience with for example
cloud tools and cloud environments
being used as
platforms for vendor lock-in
I think that you described especially with your converged computing
kind of the way that you can push back against
scientific labs aren't
kind of in-depth to corporations
I actually think that you kind of
made a really useful example of
one way to do that in your talk so
again I actually was very very
impressed by the way you kind of explained that
I would like to know in the more general
sense how can labs and
potentially RSEs make use of
cloud tools without
getting locked in or
becoming beholden again to a corporate environment
and again by the way I think that you effectively did that in this talk
so I'm more looking for a general kind of
thought about that
You're totally correct that vendor lock
is an issue and when you tend to see
many sort of niche APIs in different
clouds and then you built your entire thing around them
you do face that as an issue
but the great thing about Kubernetes is
that it is this open source project
that is available across clouds
there are subtle differences but if you make a workload
that can run on Kubernetes
you're going to have an easier time to move it between clouds
and that's you know speaking from my lab
we work on flux framework and one of our goals with
flux is to make things
portable not just between clouds
but between cloud and HPC
that's also something like
user netties running actually
Kubernetes on bare metal alongside HPC
is so important because all of a sudden
you have the same workload
and it runs in all the places
that is sort of like the vision we don't
we want to make sure that the scientific
workloads that we're running today can run
in all places not just to
one niche specific cloud
not just one niche specific center
just convergence
TLDR
that is very exciting and I really appreciate that response
thank you so much
okay that's all we have time for
this workout great Vanessa I hope you agree
yeah it was really fun if anyone has further questions
and stuff please reach out to me
I love chatting it was a pleasure chatting
with you and I hope you have a great rest of your fun
then
thank you
and the best way
to reach out to Vanessa is via HPC
social so don't forget to grab a sticker
and you walk out please consider
doing a small donation in the box as well
to help cover the costs and if you're leaving
please check if you see any trash
around please take the trash with you
bottles anything
anything you clean up we don't have to clean up
thanks a lot Vanessa this was great
bye