with the next talk, Eduardo is going to talk about streamlining your cluster with stuff.
You continue to hate containers, no?
So hey, big topic switch, big data just went out the room, now Kubernetes.
So this is about, this is the first of three Kubernetes talks.
The first time Kubernetes is making a big room in the HPC room, so I'm happy for that.
Today I'm going to be talking about something that very much since I knew Kenneth, it's
been something I care of and is how to properly create labels and annotations so we can properly
set the workloads where they need to go and where they don't need to be.
So let's talk about that.
And I want to talk with this, everything that I'm going to be talking about, just thinking
about automating.
We are going to be automating how we use Kubernetes clusters.
But then why this conversation?
This Kubernetes and containers conversation that started like six, seven years ago in
the HPC ecosystem, the promise was that once we implemented containers and Kubernetes,
we could kind of like forget about infrastructure and we could just like jump into a new adventure,
everything was going to be abstracted from us.
It was kind of like billboard going into this adventure, but it took six movies to actually
complete the adventure and I feel we are still in the hobbit.
We haven't jumped into the Lord of the Rings kind of conversation in this topic.
Who am I?
I am doing containers for HPC since singularity, so since a long time ago.
Currently I am working at NVIDIA trying to make everything Kubernetes and GPUs easier
and better for everyone.
So basically contributing to Kubernetes and all upstream projects to make it easier for
you to run GPU workloads on any cloud-native ecosystem.
So this is the agenda for today.
I want to be talking about the no-fisher project and everything and all the amazing things
that you can do with it.
So just a quick, if you scan this, it will take you to the GitHub webpage of the no-fisher
discovery.
This QR code is going to be along other slides, so don't worry if you miss it once.
So what is NFD as I call it or the no-fisher discovery?
It's basically a Kubernetes 6 project.
Kubernetes 6, so the 6 word there means is a special project that is under the Kubernetes
umbrella and is managed by all the CNCF protocols.
It's an add-on for detecting hardware and features in your system.
That's what I will be talking about.
And what I want to be doing this with the Hobbit and a lot of the RIMS conversation is because
I really like the topic of Smaug talking with Bilbo and it's like they are always labeling
themselves.
I am the clue finder, I am the web cutter.
They are always setting labels to themselves to make themselves bigger just to identify
who is in the room because during that conversation Bilbo is invisible to his eyes.
But what is NFD?
So NFD basically four components.
Is the NFD master worker, the topology of data, that is a new thing, and the garbage
collector.
So why this split and not just a single component?
Four security reasons.
So the NFD master is the only container running in your Kubernetes cluster that actually has
all the permissions and privileges to label the nodes to create things, annotations, and
all the features that I will be explaining today.
And all the three other components communicate constantly with the NFD master to tell it
what to do.
So the NFD worker is the main component of NFD which is basically a demon set.
A demon set in Kubernetes is a container that is running everywhere basically.
And this container will go to your nodes and discover everything that is available to your
container.
Right?
So I want to be very specific here.
So everything that is available to the container.
So if you are not by mounting something in the container or if you are not allowing your
container to see things in your system, the NFD worker is not going to see that.
Everything that is on the eyes of a container and the privileges of the NFD worker container
is going to be advertised to NFD master and NFD master is going to advertise that to you.
The topology updater is part of an ongoing effort to create topology aware scheduling
in Kubernetes.
So it's a very HPC related topic.
This topic is undergoing development.
The topology updater now has like two years and topology aware scheduling in Kubernetes
is also now like a year and a half and it's getting very stable.
So Kubernetes is getting very, very closely to having more and more HPC, like deep dive
HPC workloads.
And the garbage collector is for removing those things that maybe were removed.
And a very good example for me that I work on NVIDIA is that some people turn on the
systems.
They start running Kubernetes and by some reason they remove the GPUs like manually.
So we had to implement a component that like every minute goes and check back if everything
is actually there and like update the labels.
This is how the API looks like.
So the API for no feature discovery for those that know how to define an API in Kubernetes
is this is a custom resource definition.
This is how the API looks like.
And basically every component in Kubernetes, like even from another nice space, if you
give them permissions to a deep to update patch or create new custom resource definitions,
node feature, as we see here, you can actually create a sidecar containers for the NFD master
of the no feature discovery.
As an example, from NVIDIA, we have a sidecar container that we call the GPU feature
discovery.
And it's basically a container that knows about GPUs, the NVIDIA GPU, right?
Like it become very specific.
We have examples as Intel has a similar container.
AMD is working in a similar container.
So basically you have a container that is very specific to whatever you need.
And it will communicate back to NFD master using this API.
But when I say that the NFD worker is going to see everything you have in your system,
I truly mean everything.
If you enable everything that NFD can expose to you, it will become at least that you will
never read.
Basically, you will break it seeing your Kubernetes cluster and you will never get to use it.
So let's trim it down.
So we have automated node labels.
So why automated node labels?
It became a point where we were creating so many labels, features, and we were exposing
so many information to the user in their Kubernetes clusters.
That it was not usable.
So we defined it rules.
And we defined it rules for this same topic.
When a cluster is so big, it has to label itself.
So when a smoke goes around saying like I am a smoke that tremendous, I am a smoke
the mighty, it's the same analog, right?
So we have to tell the cluster like, OK, these are the only labels I want you to know from me.
And this is how it works.
So it's a node feature rule.
So we created an API to communicate with the NFD master to tell the NFD master, OK, for
all the information that you are getting from the NFD worker, only advertise the things
that reach that match these features.
And the match feature API is very rich.
So we support rejects.
We support match expressions.
Just we say like if the operator exists, if the operator is true or false, if the operator
is greater than, is smaller than, equals two.
So we support rejects and match expressions.
And this way we can tell NFD master, OK, for all that information you are getting, just
create these three or four labels that I really care about.
I don't care about all that other information that you are getting.
Automated things.
So this is another important topic.
Let's say I have a big cluster, but only in this big cluster I have a specific side of
the cluster that is like 10 super expensive GPUs and I don't want anyone to go into those
GPUs unless I specifically say so.
But I need to identify where the GPUs are and then taint the nodes and then create the
labels.
And if you start scaling that to HPC, right, like you start talking about thousands of
nodes, that becomes a very hard manual label.
So only the wordy shall pass, right?
So in NFD we have a way of creating no official rules that will create taints instead of labels.
So we can, using the same no official rule API, we can tell NFD master, OK, for the
information that you are getting from the NFD worker, if a NFD worker tells you that
a node in the system has this and this feature, please taint that node, right?
So for those that use Qnettis, you know that once a node is tainted, the node gets drained
and only workloads with the node selector and the tolerance will run into that node,
right?
So this way we can automatically taint nodes or any infrastructure to be free from users
unless we actually define so.
I have a node there.
OK, yeah.
Before enabling taints, we have to say this because it's an ongoing discussion for security
reasons.
We try to implement a feature for the NFD worker to auto kind of like remove himself from the
picture when the node is being drained.
A lot of the security people chime in and say that that is very unsecure.
So now by default, when you use the taint featuring NFD, the NFD worker also gets removed
from the node.
So it's kind of like it goes, it creates a label, it creates the taints and Qnettis
drain process actually takes NFD out of the node itself.
Extended resources.
You need to inventory what you have.
Like I really like this scene when his entire grocery goes out in a single night.
But we have to do the same for our nodes, right?
Using the node featured rule again, so this is a very rich API.
We can define resources, node labels.
So as we can see here down, we can create allocable and capacity.
So when you define a node in Qnettis, we can actually expose resources that are consumable.
This feature was actually created for something that now became, for the first time became
a Debrum that is the confidential computing.
So when you are using confidential computing in Qnettis or in any cluster, by that matter,
you are consuming some, let's call it tokens that every time you are running a confidential
container, you have to consume one of these tokens to run the container, right?
And watch, you reach a point where the node is out of these tokens, no more containers
can run until one of the containers is actually killed, right?
So this feature was created to expose how many tokens the node had left and we did this
with NFD.
Why this is a very specific use case?
Because the tokens that you consume when using confidential computing, they are featured
in the kernel.
So they are always exposed to the container.
So these extended resources doesn't replace what is a device plug-in for Qnettis, right?
You cannot tell NFD, like, oh, go and identify all the GPUs and create extended resources
for the GPUs because NFD will create extended resources but your containers won't get
the buy mounts and won't get all the other things that a device plug-in does in Qnettis,
right?
So it's a very big caveat here that extended resources may be only used for the things that
a regular container can see, right?
And with that, I wanted to run short.
And that's the whole introduction to NFD.
It's been running for three or four years.
It's an open source project and everyone is welcome to contribute.
Right now, the only ones contributing to it really are hardware vendors.
So Intel, AMD, NVIDIA, Melanox, now NVIDIA as well.
But yeah, every hardware vendor is contributing to it so you can expose your hardware to Qnettis.
And also, not hardware vendors, right?
The confidential computing community has been also contributing to NFD to expose all the
new things that go into the kernel so you can consume confidential computing.
Now you can expose that to Qnettis using NFD.
And with that, Kenneth, questions?
Thank you.
What about mic technology?
What about mic technology?
If I reconfigure my mic layout of my NVIDIA GPUs, how quickly will it be picked up that
has changed?
You can actually define that in the configuration of NFD when you're deploying it via Helm.
So by default, it's every 60 seconds.
But you can go to one second if you want.
So that's up to you, right?
We have been working a lot in NFD to make it CPU not intensive.
So like a full runoff detecting everything in your node takes like 0.3 milliseconds
and doesn't consume more than like 10% of a single-threaded CPU.
So it's a very lightweight container given everything that it does.
So we try to keep it optimized.
But if you want, you can take it down to one second, but by default, it's every 60 seconds.
I have a feeling the previous question might have answered this, but if I wanted to get
GPU operator to say label an A or a V100 and say enable MIG on V100 after, does that mean
that it will say go NFD through the NVIDIA stuff, my custom stuff, then back to the
NVIDIA stuff?
It's not a one-shot thing.
Or is there an ordering basically?
Yeah, it goes in order.
So when you deploy the GPU operator that you mentioned, the GPU operator deploys the GPU
Fisher discovery that is a side card to NFD.
So what the GPU Fisher discovery does is that it utilizes the API of NFD.
If the API doesn't exist, the GPU Fisher discovery doesn't work.
So that's why you first need NFD to then deploy the GPU operator.
Once you deploy the GPU Fisher discovery, it will call NFD with this API and it will
tell like, hey, I realize this node is a V100 or a Grace Hopper.
But that is going to be on the hands of the GPU operator, so you don't have to worry about
that.
That's why the GPU operator deploys a side card container for that.
Because it's a container that is specialized on NVIDIA hardware.
The same thing will happen if you deploy some specific Intel products.
Intel will also deploy a side card container that is super knowledgeable in Intel infrastructure
and it will communicate with NFD for Intel Fisher.
That's why I'm asking, can you have NFD to discovery from one another and then deploy
NFD to the discovery of a different pattern?
Oh, yeah.
The node Fisher rule is you can create cycles.
If you define a rule that will create a label, you can have another rule that says once this
label exists, do also this.
So the node Fisher rule is you can stack on rules.
We actually use that for confidential containers.
We first define with a Fisher rule, we create some labels that if your node supports the
cryptographic confidential stuff, and once those labels exist, we deploy another Fisher
rule that reads those labels and does other stuff.
Oh, hi.
Being swapped to Kubernetes.
Good use case for this.
Figurations on nodes.
Everything.
So NFD will expose every kernel Fisher to the point that it will expose which kernel
flags were used when booting the node.
So everything is.
As I said, unless you configure your cure net is not to expose certain things to a container,
NFD is going to read it.
But yeah, we try to keep it as big as possible and leave to the cure net is system administrator
to restrain what goes and what doesn't go into the container.
Oh.
Very clearly the 60 seconds sounds very much like an infinity band subnet manager.
And in slarm you would schedule jobs that's infinity band switch. Could you think of extending
this to have Kubernetes scheduling on nodes which are close to each other than the fabric?
Yes.
Thank you for the question.
Like a two step answer.
At NVIDIA we also developed what we call the network operator that is basically an operator
that you deployed in a cure net is closer. If you are using infinity band or any melanoc
scar, it will configure your cluster for you.
And it will deploy NFD to help him like himself do all of that.
So the network operator will leverage NFD to set up your infinity band and all your
network configuration.
Now the topology of the nodes, that's something we have been working on.
But yeah, we hope to have it this year in NFD because we first need to run like an MPI.
We already have a proof of concept.
And the proof of concept is running an MPI job.
And this MPI job is going to help us identify the topology of the cluster so we can then
create labels for like a wish node disclosure to wish node and you can then use that information.
We actually, like if you reach to me after this, we publish a paper with Livermore University
like two years ago where we did a proof of concept of this.
And it works.
But it's still like, we're still discussing how to implement it.
But we can define the entire topology of the cluster using NFD.
Yeah, thanks for the talk.
Do you have like what misuses of NFD haunts you at night?
Like did you see something, someone misusing it in a way that you didn't expect?
Or do you have misuse cases that you came up by yourself?
Maybe?
Yeah.
So the project was actually start by just Red Hat and Intel.
It had grown bigger.
We kept getting users from many, many projects using it.
So it's not only just for like Intel and NVIDIA.
We have banks using it.
We have people that use their clusters as for like databases and things like that.
And recently on the Slack channel, we got people that are using it for even for testing,
right?
And they are, you have a big cluster and you want to use taints and annotations to like
create mini clusters inside your cluster.
You can use NFD for that.
So yeah, it's not just for a specialized hardware, but we more and more have seen people using
it for even software features.
But yeah, it's...
Oh, yeah, NFD currently is being deployed in thousands of Kubernetes clusters today.
It's a very well used Kubernetes project.
And being a maintainer to that is a good responsibility.
I would like to ask if the labors are standardized.
So I'm wondering if I have this deployment in one cluster, then okay, if I have this
deployment in two different clusters, then can I move my application from one to another
transparently or do I have to change everything in the deployment side?
Since pretty much three years ago, the labels haven't changed.
So I think it's very small, but you can read that every label is going to be fissure.node.kernet
is that I O slash your fissure, right?
Like CPU or system kernel.
Yeah, this list is pretty much infinite.
The other thing that we are also trying to keep very stable is the sidecar APIs.
So when NVIDIA going back to the example of the GPO operator communicates with NFD to create
the labels, the label is reading NVIDIA.com slash something.
And when Intel creates a label, it's Intel.fissure.io slash.
So it's like we try to keep the same label so we can guarantee users that they can move
from one cluster to another.
Just redeploy NFD and everything will work back again.
Okay, thanks.
Yeah, we try to not touch the labels.
In the last release of NFD, we added a fissure where you can define your predefined prefix,
but it's at your own risk, right?
Any more questions?
No?
Okay, sign speaker.
All right, time to go.
Yeah, I am.
I'm just showing up.
All time.
Taking over the entire, taking over another person's talk.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
So I thought I would open this talk by putting two words on the slide that are either going
to make you very angry or very anxious.
Those words are cloud and HPC.
So probably the question on everyone's mind is what does the future look like?
I'm going to answer this question by posing a question back to you.
Where is the money going?
Okay, I think I can still, oh, zoom died.
He died there for a second.
Yeah.
So I.
Okay, we'll do what we can.
It's some.
I'm surprised.
Don't have a full room.
So that's.
Yeah.
Yes.
Yeah.
Yeah.
Yeah.
Yeah.
Hello.
Hello.
Who's going to get left behind?
We can look at a paper from read Dan and Dagar from 2023 that identified some really interesting
trends.
Okay.
Okay.
Now we have Kevin who's going to tell us more.