[00:00.000 --> 00:29.920]  Okay, next talk is Pablo.
[00:29.920 --> 00:36.920]  Who is going to explain us how to set up slurm client environments more easily.
[00:59.920 --> 01:06.920]  My name is Pablo and I have been running the HBC Clusters since I was 9 years.
[01:06.920 --> 01:14.920]  I was running the HBC Clusters at CERN and got involved mostly in slurm, running slurm.
[01:14.920 --> 01:23.920]  That's when I came up with the idea for this tool since about 8 or 9 months ago I started
[01:23.920 --> 01:29.920]  the HBC Clusters and I'm also participating in the SKA project, hence the pretty background,
[01:29.920 --> 01:37.920]  where we do also things related to the HBC infrastructure.
[01:37.920 --> 01:44.920]  So just a brief introduction to slurm in case anybody is not familiar with it.
[01:44.920 --> 01:52.920]  Slurm is basically both a resource manager and a job scheduler, meaning slurm will manage
[01:52.920 --> 02:00.920]  their allocations, it will track which machines are used to which jobs and which users own,
[02:00.920 --> 02:04.920]  which CPUs and which nodes, etc.
[02:04.920 --> 02:10.920]  And it's also the job scheduler, meaning it will, when users submit jobs, you have your
[02:10.920 --> 02:14.920]  happy users over there, or hopefully it will be happy users.
[02:14.920 --> 02:19.920]  And they can be one-on-one on your cluster, so they make a job submission, usually writing
[02:19.920 --> 02:25.920]  a script that launches some workloads.
[02:25.920 --> 02:30.920]  And they will basically interact with slurm and slurm will manage all of these job submissions.
[02:30.920 --> 02:34.920]  You won't just have one by one, you will have hundreds or even thousands of jobs that are
[02:34.920 --> 02:40.920]  scheduled to run on your infrastructure and slurm will manage the views and the priorities
[02:40.920 --> 02:43.920]  and the accounting, etc.
[02:43.920 --> 02:50.920]  So basically it's a batch manager, but there's both resource managing and the scheduling
[02:50.920 --> 02:53.920]  of the jobs.
[02:53.920 --> 02:58.920]  Building a bit deeper into how slurm works, because this is relevant for this talk, there's
[02:58.920 --> 03:04.920]  basically two main components, two units that are the most relevant, and those are the
[03:04.920 --> 03:14.920]  controller, which is called the slurm CTLD, and then the deals that run on the worker nodes
[03:14.920 --> 03:16.920]  at the bottom, which is the slurm VDU.
[03:16.920 --> 03:20.920]  And then you have other demons like the slurm VD, slurm RST, slurm RST.
[03:20.920 --> 03:26.920]  Those are not relevant for this talk, I will mostly focus on the part on the left here.
[03:26.920 --> 03:34.920]  So users and client tools, they basically interact with the controller over a slurm protocol.
[03:34.920 --> 03:39.920]  There's nowadays a slurm RST, so you can also interact with the rest with some scripts,
[03:39.920 --> 03:46.920]  but mostly all the user lab tools, and mostly almost everything in the slurm ecosystem just
[03:46.920 --> 03:52.920]  talks to the slurm CTLD, and this controller handles the source of truth for slurm, so it
[03:52.920 --> 03:57.920]  knows which resources are allocated where, it knows which jobs exist, and knows who the
[03:57.920 --> 04:00.920]  users are, etc.
[04:00.920 --> 04:05.920]  The controller talks to the slurm units, and talking to the nodes and the slurm units are
[04:05.920 --> 04:09.920]  in charge of launching the jobs, so you do the cleanup, setting up the seed routes for
[04:09.920 --> 04:12.920]  the jobs, whatever you have.
[04:12.920 --> 04:17.920]  Now, what's important here is to know that for all of this to work, you need at least
[04:17.920 --> 04:22.920]  the same thing. You need the slurm conflict files, and they need to be instinct between
[04:22.920 --> 04:34.920]  the whole cluster, so you may have some difference, but mostly it should be the same.
[04:34.920 --> 04:38.920]  There was no audio online? Okay.
[04:38.920 --> 04:44.920]  So as I was saying, the slurm CTLD handles the source of truth.
[04:44.920 --> 04:50.920]  The slurm units are in charge of launching the jobs, and the two important things are
[04:50.920 --> 04:54.920]  that you need the slurm configuration files.
[04:54.920 --> 04:57.920]  It's mostly the slurm.conf file, but there's other files as well.
[04:57.920 --> 05:03.920]  Those need to be in sync in the whole cluster, and they need to be basically the same.
[05:03.920 --> 05:05.920]  They should have the same hash, ideally.
[05:05.920 --> 05:11.920]  And then you should also have a shared secret so that nobody can, a rogue client cannot just
[05:11.920 --> 05:15.920]  add a worker node to the cluster and start doing malicious things.
[05:15.920 --> 05:18.920]  So you have usually it's a munch secret.
[05:18.920 --> 05:24.920]  It's called the demon called munch, and you have a shared secret as well for the whole cluster.
[05:24.920 --> 05:29.920]  And this fact is important, is very relevant for this talk.
[05:29.920 --> 05:32.920]  Now, up to containers.
[05:32.920 --> 05:37.920]  So containers are increasingly becoming a super popular tool to run infrastructure for
[05:37.920 --> 05:41.920]  reproducibility, for automating deployments.
[05:41.920 --> 05:48.920]  And just in general, they're becoming super ubiquitous in our industry.
[05:48.920 --> 05:52.920]  I think for good reasons.
[05:52.920 --> 05:59.920]  And there are, I think, very good use cases for using containers with slurm.
[05:59.920 --> 06:06.920]  In this talk, I will focus on the use case where you use containers on the user and client
[06:06.920 --> 06:07.920]  side of things.
[06:07.920 --> 06:13.920]  So those tools that will talk to slurm, to the controller mostly, to do things on the cluster.
[06:13.920 --> 06:17.920]  So this could be some automation that you have run to do whatever.
[06:17.920 --> 06:21.920]  For instance, you could use it for monitoring purposes.
[06:21.920 --> 06:27.920]  You could write a tool that does health checks on the cluster for accounting.
[06:27.920 --> 06:30.920]  I've used it extensively for accounting as well.
[06:30.920 --> 06:33.920]  But also integration with other services, right?
[06:33.920 --> 06:39.920]  Or if you want to connect the Jupyter notebook with slurm, you will end up with some tools
[06:39.920 --> 06:45.920]  that talk to the controller.
[06:45.920 --> 06:55.920]  Now, there are basically two scenarios in which you can use containers with slurm.
[06:55.920 --> 06:57.920]  On the left, we have the local use case.
[06:57.920 --> 07:03.920]  That means imagine you have a frontend mode, you have a machine that's configured where it uses SSH2.
[07:03.920 --> 07:10.920]  And from there, they can run the slurm commands to launch jobs, to track their job usage, et cetera.
[07:10.920 --> 07:13.920]  It's conventionally called frontend mode for the cluster.
[07:13.920 --> 07:19.920]  So if you just add the slurm client container on that node, it's very simple.
[07:19.920 --> 07:26.920]  Because you can just, as I said, you need a secret with munch, and you need the config files.
[07:26.920 --> 07:29.920]  And that scenario is very simple because you can just do bind mounts,
[07:29.920 --> 07:32.920]  and you can access the munch socket to talk to slurm.
[07:32.920 --> 07:38.920]  And you might bind mount the slurm config directory, and you're done, basically.
[07:38.920 --> 07:40.920]  So that's sort of easy.
[07:40.920 --> 07:47.920]  However, what if you have, for the use case on the right, you have the distributed or remote use case.
[07:47.920 --> 07:55.920]  And in that case, you may run your slurm client container in a different service.
[07:55.920 --> 08:00.920]  That's a different network, or you may run it on Kubernetes or somewhere else.
[08:00.920 --> 08:07.920]  In that case, you obviously can't just do the bind mounts because you need to give it all those things.
[08:07.920 --> 08:13.920]  So you would have to give it all the slurm config files and somehow the munch shared key
[08:13.920 --> 08:21.920]  so that your external service can talk to your cluster, right, specifically to the slurm controller.
[08:21.920 --> 08:25.920]  Now, this is an extraction from a Docker file.
[08:25.920 --> 08:26.920]  This is the naive approach.
[08:26.920 --> 08:29.920]  This is how I started trying things.
[08:29.920 --> 08:30.920]  Easy, right?
[08:30.920 --> 08:36.920]  You just take the slurm config, and you just copy it to the destination, right?
[08:36.920 --> 08:38.920]  And this will absolutely work.
[08:38.920 --> 08:45.920]  But I was not happy with this approach because then you end up managing two copies of your slurm config.
[08:45.920 --> 08:52.920]  And I really like having a single source of truth for when you do configuration management and automation of your infrastructure,
[08:52.920 --> 08:54.920]  I really like having a single source of truth.
[08:54.920 --> 09:02.920]  And managing this in this way with containers is very fiddly because it's very easy that you will forget to update it
[09:02.920 --> 09:04.920]  or something that will fail to update it automatically.
[09:04.920 --> 09:06.920]  It's just not ideal.
[09:06.920 --> 09:08.920]  I didn't like this approach, but it will work.
[09:08.920 --> 09:11.920]  It will work.
[09:11.920 --> 09:17.920]  And some of you who know slurm may say, oh, but Pablo, why wouldn't you just use slurm's config less feature?
[09:17.920 --> 09:28.920]  So slurm config less is a new feature since slurm 20 or so that will basically allow a client to just pull the config files from slurm.
[09:28.920 --> 09:36.920]  So the slurm ddemons that run on the worker nodes, when they start, they will just grab the slurm config files.
[09:36.920 --> 09:42.920]  So you can just remove the needs to even copy the slurm config, right?
[09:42.920 --> 09:46.920]  Well, that's a trick question.
[09:46.920 --> 09:53.920]  Not necessarily because then you need to run a slurm ddemon in your container.
[09:53.920 --> 09:56.920]  And you also need the munch demon.
[09:56.920 --> 10:00.920]  And it sounds easy, but it's really not.
[10:00.920 --> 10:08.920]  You will need to do a lot of hacks. This is an instruction from a container that I was creating.
[10:08.920 --> 10:11.920]  And you run in lots of awful things.
[10:11.920 --> 10:21.920]  Like the slurm ddemon expects this release agent file to exist in the C group and the containers, they just don't create it.
[10:21.920 --> 10:26.920]  I tried it on Docker. I tried it on different Kubernetes versions. It just doesn't exist.
[10:26.920 --> 10:31.920]  I don't know why. I couldn't find out why. If anybody knows, please tell me.
[10:31.920 --> 10:36.920]  I googled around a found that could have been related to some privilege escalation issues.
[10:36.920 --> 10:39.920]  However, if you just remount the C groups, the file appears.
[10:39.920 --> 10:42.920]  So I'm not sure what's going on there.
[10:42.920 --> 10:47.920]  Another fun story is that, for instance, if you're using Kubernetes,
[10:47.920 --> 10:57.920]  Kubernetes likes to give a sim link to your secrets, and munch refuses to take the secret from a sim link for security reasons.
[10:57.920 --> 10:59.920]  It makes sense. So there's no more.
[10:59.920 --> 11:01.920]  So you will need to put in hacks.
[11:01.920 --> 11:05.920]  And it's hacks on top of hacks on top of hacks just to run these two demons.
[11:05.920 --> 11:10.920]  And yeah, I was not very happy with this approach either.
[11:10.920 --> 11:15.920]  So basically I was faced with two options.
[11:15.920 --> 11:17.920]  We arrived at this situation. You're faced with two options.
[11:17.920 --> 11:23.920]  Either you basically do the first naive approach where you just copy all the stuff into your slurm container.
[11:23.920 --> 11:26.920]  You manage a copy of your slurm config files.
[11:26.920 --> 11:33.920]  But as I said, if you want a single source of truth, this might not be ideal.
[11:33.920 --> 11:37.920]  You also need, of course, in the case of use case, unless you need munch,
[11:37.920 --> 11:39.920]  and you need to supply the munch key.
[11:39.920 --> 11:43.920]  Or you can try the configless approach, but then you need to add slurm d to your container
[11:43.920 --> 11:46.920]  so it can pull via configless your config files.
[11:46.920 --> 11:48.920]  But then anyway, you also need munch.
[11:48.920 --> 11:52.920]  And you need to add the munch key to your container somehow and managing secrets.
[11:52.920 --> 11:56.920]  I mean, if you're running Kubernetes, it might not be a big issue or some other container manager.
[11:56.920 --> 12:02.920]  But you will still need to maintain all these extra demons with nasty hacks.
[12:02.920 --> 12:08.920]  And we don't always like all these having lots of hacks in our infrastructure.
[12:08.920 --> 12:13.920]  There's a third option, by the way, which is trying to go secret less.
[12:13.920 --> 12:18.920]  It doesn't work in combination with configless, where you try to use JSON web tokens.
[12:18.920 --> 12:21.920]  But it gives a lot of issues. It doesn't really work. I tried it.
[12:21.920 --> 12:23.920]  So I didn't include it here.
[12:23.920 --> 12:28.920]  Just mentioning it in case somebody thought about it.
[12:28.920 --> 12:31.920]  So Pablo, you talked about the bad and the ugly. What about the good?
[12:31.920 --> 12:33.920]  Is there any good part to this?
[12:33.920 --> 12:34.920]  I'm glad you asked.
[12:34.920 --> 12:39.920]  Yes. What if we had a single shot CLI tool,
[12:39.920 --> 12:44.920]  that just a very simple tool that just was able to authenticate to the controller,
[12:44.920 --> 12:48.920]  either using munch or JSON web tokens, which Slurm also supports,
[12:48.920 --> 12:52.920]  and just fetch the config files, and then it's done.
[12:52.920 --> 12:55.920]  That's all you really want to do, right?
[12:55.920 --> 12:59.920]  Because then your tools, the Slurm tools can work,
[12:59.920 --> 13:04.920]  because they have the Slurm config files, and just by having the JSON web token in your environment,
[13:04.920 --> 13:08.920]  you can just talk to the Slurm controller.
[13:08.920 --> 13:11.920]  And yeah, that's the tool that I wrote.
[13:11.920 --> 13:15.920]  It's a very simple tool. It just does exactly what I described there.
[13:15.920 --> 13:19.920]  And it's open source. You can find it on GitHub.
[13:19.920 --> 13:22.920]  I uploaded it in the past month.
[13:22.920 --> 13:24.920]  Fun story about this.
[13:24.920 --> 13:27.920]  As I said, I had the idea for this when I was back at CERN.
[13:27.920 --> 13:31.920]  I worked on this a year ago already.
[13:31.920 --> 13:34.920]  But then I somehow lost the source.
[13:34.920 --> 13:36.920]  I don't know what happened.
[13:36.920 --> 13:39.920]  Just before I left CERN, the source was just lost.
[13:39.920 --> 13:42.920]  I don't know why. I must have deleted it by accident.
[13:42.920 --> 13:44.920]  I don't know what happened.
[13:44.920 --> 13:47.920]  So after I left CERN, I kept in contact with my ex-colleagues,
[13:47.920 --> 13:51.920]  and they were telling me that they wanted to do this integration between the swan,
[13:51.920 --> 13:55.920]  which is the who here knows swan? Anybody?
[13:55.920 --> 13:57.920]  Okay, one, two, three.
[13:57.920 --> 14:00.920]  Yeah, so it's the Jupyter Notebook Service for CERN,
[14:00.920 --> 14:03.920]  which also does analytics.
[14:03.920 --> 14:05.920]  And we wanted to connect it to Slurm,
[14:05.920 --> 14:09.920]  and we run into all these issues, because this is a service that's exposed to the whole internet.
[14:09.920 --> 14:14.920]  So we didn't want to have the munchkey for the Slurm cluster in the container, et cetera.
[14:14.920 --> 14:19.920]  Anyway, so then I left CERN, and then, yeah, my colleagues were telling me,
[14:19.920 --> 14:22.920]  oh, it would have been so useful to have this at Watapiti.
[14:22.920 --> 14:29.920]  And then a few months ago, I just didn't like the fact that I had lost the source and all these days.
[14:29.920 --> 14:34.920]  I spent a couple of days reverse-engineering the Slurm protocol,
[14:34.920 --> 14:42.920]  and I just didn't like losing it, so I just rewrote it more properly in Python and just made it public.
[14:42.920 --> 14:47.920]  So if you're interested in making client containers like this,
[14:47.920 --> 14:51.920]  feel free to give it a try.
[14:51.920 --> 14:53.920]  It looks a bit like this.
[14:53.920 --> 14:55.920]  It's very simple.
[14:55.920 --> 15:01.920]  You can choose between munch or JWT, JSON WebToken's authentication.
[15:01.920 --> 15:04.920]  If you choose JWT, which is the most simple one,
[15:04.920 --> 15:08.920]  you just need an environment variable with a token,
[15:08.920 --> 15:12.920]  and you can tell it where you want to store the config files,
[15:12.920 --> 15:16.920]  and then you have verbosity as an option.
[15:16.920 --> 15:18.920]  So it's very simple.
[15:18.920 --> 15:24.920]  It has very little dependencies.
[15:24.920 --> 15:31.920]  So the tool talks several Slurm protocol versions,
[15:31.920 --> 15:37.920]  because with every major release, Slurm changes the protocol versions.
[15:37.920 --> 15:40.920]  So you can list them with minus L,
[15:40.920 --> 15:46.920]  and it will show you basically all the versions that it supports.
[15:46.920 --> 15:50.920]  So imagine you have a Slurm WebToken in this variable.
[15:50.920 --> 15:55.920]  You can just tell it to do JSON WebToken authentication with the server.
[15:55.920 --> 16:00.920]  It supports multiple controllers in case you have high availability set up in your Slurm cluster,
[16:00.920 --> 16:04.920]  so you can specify a list of servers that it will retry until it succeeds,
[16:04.920 --> 16:07.920]  and then you tell it the protocol version of the Slurm CTLD,
[16:07.920 --> 16:11.920]  because it needs to know what protocol it should talk.
[16:11.920 --> 16:15.920]  The protocol version negotiation, I think it doesn't exist in the Slurm protocol,
[16:15.920 --> 16:18.920]  so you have to tell it which version you want it to talk,
[16:18.920 --> 16:22.920]  and that's it, and then it will just download the Slurm config files
[16:22.920 --> 16:27.920]  and happy days for your containers.
[16:27.920 --> 16:32.920]  Conclusions, I think I'm ahead of time.
[16:32.920 --> 16:38.920]  So this tool called straw, it can simplify the cost of creating and maintaining
[16:38.920 --> 16:40.920]  your Slurm client containers.
[16:40.920 --> 16:42.920]  It can also increase the security,
[16:42.920 --> 16:45.920]  because you don't need to put the Munch key everywhere,
[16:45.920 --> 16:48.920]  where you're running your client containers.
[16:48.920 --> 16:51.920]  JSON WebToken's surface.
[16:51.920 --> 16:54.920]  Caveats, caveats.
[16:54.920 --> 16:58.920]  I think this tool should not exist,
[16:58.920 --> 17:01.920]  because ideally this would be supported upstream.
[17:01.920 --> 17:07.920]  So, you know, if anybody has any influence on SCADMD Slurm development,
[17:07.920 --> 17:13.920]  yeah, I think it would be nice if we had this built-in into Slurm.
[17:13.920 --> 17:18.920]  And then the second caveat is that the JSON WebToken,
[17:18.920 --> 17:24.920]  the token needs to be associated with a Slurm user, basically.
[17:24.920 --> 17:30.920]  So ideally, you would be able to just generate a JSON WebToken for a user
[17:30.920 --> 17:33.920]  that's going to run on the Slurm cluster,
[17:33.920 --> 17:37.920]  and then if the secret for some reason is exposed, you've only exposed
[17:37.920 --> 17:40.920]  the JSON WebToken of a single user.
[17:40.920 --> 17:46.920]  However, this is a limitation built into the Slurm, into Slurm, basically.
[17:46.920 --> 17:49.920]  You cannot pull over the protocol the Slurm config file
[17:49.920 --> 17:54.920]  unless the token belongs to the Slurm user, or to root.
[17:54.920 --> 17:59.920]  Still, I think it's an improvement over having your Munch key available everywhere.
[17:59.920 --> 18:03.920]  If you're free to try it out, that was it.
[18:03.920 --> 18:07.920]  I'm happy to answer any questions you might have.
[18:07.920 --> 18:10.920]  APPLAUSE
[18:13.920 --> 18:16.920]  Thank you very much, Pablo.
[18:16.920 --> 18:19.920]  Time for questions.
[18:19.920 --> 18:23.920]  So what kind of clients do need the config file?
[18:23.920 --> 18:25.920]  Could you do everything over REST nowadays?
[18:25.920 --> 18:29.920]  Is it still necessary to use the config file?
[18:29.920 --> 18:34.920]  Yes, so anything that wants to run srun, sbatch, sq, sinfo.
[18:34.920 --> 18:37.920]  For instance, if you have the Jupyter Notebook plugins,
[18:37.920 --> 18:39.920]  they will just run those commands.
[18:39.920 --> 18:43.920]  Or if you want to run a client that uses PySlurm, for instance,
[18:43.920 --> 18:47.920]  or any library really, anything that uses lipslurm underneath
[18:47.920 --> 18:50.920]  will automatically read the config files, right?
[18:50.920 --> 18:55.920]  So, of course, you can write your own client,
[18:55.920 --> 19:01.920]  handwritten from scratch, that just interacts with the Slurm REST to do stuff.
[19:01.920 --> 19:07.920]  Yes, but you cannot leverage all the existing user client tools,
[19:07.920 --> 19:10.920]  and the lipslurm, PySlurm, etc.
[19:10.920 --> 19:16.920]  So if you want to create a Python tool, for instance, that leverages PySlurm,
[19:16.920 --> 19:20.920]  this would be, I think, a good solution.
[19:20.920 --> 19:25.920]  I think Slurm does have, like, a REST API, but it's considered very insecure.
[19:25.920 --> 19:29.920]  So even the documentation tells you, like, don't use this.
[19:29.920 --> 19:33.920]  I just didn't understand, like, for a long time now,
[19:33.920 --> 19:36.920]  why everyone needs the config file, right?
[19:36.920 --> 19:38.920]  I mean, why doesn't it need to be in sync?
[19:38.920 --> 19:41.920]  Like, couldn't they just exchange the information over the protocols now
[19:41.920 --> 19:43.920]  and just say, like, this is your Slurm server?
[19:43.920 --> 19:46.920]  Yeah, that's a configless feature. That's a configless feature, essentially.
[19:46.920 --> 19:48.920]  Yeah, but the configless feature just downloads the config.
[19:48.920 --> 19:49.920]  Yes.
[19:49.920 --> 19:50.920]  Next, like, config less OK.
[19:50.920 --> 19:51.920]  Yes.
[19:51.920 --> 19:54.920]  I download the config. I don't need the config beforehand.
[19:54.920 --> 19:57.920]  It's like serverless. There's always a server somewhere.
[19:57.920 --> 20:00.920]  Yes. Yeah, exactly.
[20:00.920 --> 20:03.920]  So that's just how Slurm works.
[20:03.920 --> 20:04.920]  Yeah.
[20:04.920 --> 20:14.920]  So I'm still a little confused about the Slurm client container.
[20:14.920 --> 20:17.920]  So the container is an application on the actual Slurm client,
[20:17.920 --> 20:19.920]  because you have to document in the SlurmConf,
[20:19.920 --> 20:21.920]  you have to sort of say what your clients are
[20:21.920 --> 20:26.920]  so that the scheduler can intelligently decide how to schedule jobs, right?
[20:26.920 --> 20:27.920]  I'm missing something.
[20:27.920 --> 20:31.920]  No, you don't really need to declare all the clients for Slurm.
[20:31.920 --> 20:36.920]  You just need to declare the worker nodes that are part of it.
[20:36.920 --> 20:38.920]  But you can have any...
[20:38.920 --> 20:40.920]  I mean, it depends on how you've configured it.
[20:40.920 --> 20:44.920]  You can limit it. You can limit in Slurm which clients are allowed to connect,
[20:44.920 --> 20:45.920]  but you don't have to.
[20:45.920 --> 20:47.920]  So you could just...
[20:47.920 --> 20:49.920]  But even if you do, you will need this,
[20:49.920 --> 20:51.920]  because you will...
[20:51.920 --> 20:54.920]  Even if you authorize a host name to connect as a client,
[20:54.920 --> 20:59.920]  it will need to have the munch key and the SlurmConf files, et cetera.
[20:59.920 --> 21:01.920]  Does this answer your question?
[21:01.920 --> 21:03.920]  Well, no, so when you...
[21:03.920 --> 21:06.920]  In the Slurm.conf, you sort of detail what your positions are,
[21:06.920 --> 21:09.920]  and you have to kind of tell it what the capabilities are of your clients,
[21:09.920 --> 21:10.920]  of your Slurm clients, right?
[21:10.920 --> 21:12.920]  So that Slurm can decide how to schedule jobs.
[21:12.920 --> 21:14.920]  I'm missing something.
[21:14.920 --> 21:16.920]  Well, I think you're thinking about the compute nodes.
[21:16.920 --> 21:17.920]  Yeah, I am.
[21:17.920 --> 21:20.920]  Yeah, the node names part of the SlurmConf.
[21:20.920 --> 21:22.920]  So the containers run on the compute nodes?
[21:22.920 --> 21:24.920]  No, the containers would be...
[21:24.920 --> 21:27.920]  Let me go back to one of the slides where...
[21:27.920 --> 21:32.920]  So you're thinking maybe about the compute nodes,
[21:32.920 --> 21:34.920]  each of which runs a Slurm DDemon,
[21:34.920 --> 21:36.920]  and those you have to declare.
[21:36.920 --> 21:38.920]  Yes, I think in 2023, by the way,
[21:38.920 --> 21:41.920]  you will be able to dynamically spawn compute nodes,
[21:41.920 --> 21:45.920]  but that's the future.
[21:45.920 --> 21:48.920]  What I'm talking about is all the users and client tools
[21:48.920 --> 21:51.920]  that connect to the controller to run SQ as info,
[21:51.920 --> 21:53.920]  like when you use Slurm and you...
[21:53.920 --> 21:54.920]  Hello.
[21:54.920 --> 21:58.920]  So if you had some tooling that you automated
[21:58.920 --> 22:02.920]  to gather metrics from Slurm or, yeah,
[22:02.920 --> 22:04.920]  a Jupyter notebook service, for instance,
[22:04.920 --> 22:07.920]  that connects to your cluster that wants to launch jobs,
[22:07.920 --> 22:10.920]  that wants to run as batch SQ, whatever,
[22:10.920 --> 22:12.920]  that's in that domain.
[22:12.920 --> 22:15.920]  Yeah, I mean, the newest werewolf runs containers
[22:15.920 --> 22:18.920]  on my back for the stream.
[22:18.920 --> 22:20.920]  I mean, I think the newest version of werewolf
[22:20.920 --> 22:23.920]  is set up to run containers on the Slurm clients, right?
[22:23.920 --> 22:26.920]  It's sort of, you're actually launching containers
[22:26.920 --> 22:28.920]  as applications, so that was kind of...
[22:28.920 --> 22:30.920]  That's on the compute nodes.
[22:30.920 --> 22:31.920]  On the compute nodes, yeah.
[22:31.920 --> 22:33.920]  Yeah, yeah, that's the compute nodes.
[22:36.920 --> 22:38.920]  Thank you for your talk.
[22:38.920 --> 22:40.920]  So I have a question.
[22:40.920 --> 22:43.920]  You are telling that you can pull the configuration
[22:43.920 --> 22:47.920]  with your tool, but there are many...
[22:47.920 --> 22:50.920]  Fine, you can't pull with configless.
[22:50.920 --> 22:53.920]  For example, all the spank plugins,
[22:53.920 --> 22:55.920]  or I think topology, you can pull it,
[22:55.920 --> 22:58.920]  but various, like I said, spank plugins and so on.
[22:58.920 --> 23:01.920]  So how do you manage this kind of config file
[23:01.920 --> 23:04.920]  that are not ended by default by Slurm?
[23:04.920 --> 23:06.920]  Right, that's correct.
[23:06.920 --> 23:08.920]  So when you use the configless feature,
[23:08.920 --> 23:10.920]  it will download the, you know, the Slurm Conf,
[23:10.920 --> 23:12.920]  the C Group Conf, a lot of config files,
[23:12.920 --> 23:16.920]  but it will not download your plugins, your plugin files.
[23:16.920 --> 23:18.920]  But I think those are usually not needed
[23:18.920 --> 23:20.920]  if you're running a client,
[23:20.920 --> 23:22.920]  because those are usually just needed
[23:22.920 --> 23:24.920]  for the Slurm D demons, right?
[23:24.920 --> 23:26.920]  Even for the worker nodes.
[23:26.920 --> 23:28.920]  Like the epilogue, the prologue,
[23:28.920 --> 23:30.920]  you mean all of those plugin scripts, right?
[23:30.920 --> 23:32.920]  The authentication plugins.
[23:32.920 --> 23:34.920]  Those are usually needed by the Slurm D demon,
[23:34.920 --> 23:36.920]  but if you're just writing a client,
[23:36.920 --> 23:38.920]  but say you're automating something
[23:38.920 --> 23:40.920]  with PySlurm to interact with it,
[23:40.920 --> 23:42.920]  you don't need those files.
[23:42.920 --> 23:44.920]  And Slurm will happily...
[23:44.920 --> 23:46.920]  You can happily run...
[23:46.920 --> 23:48.920]  all of those commands without those files.
[23:48.920 --> 23:51.920]  Yeah, okay, so if I just summarize,
[23:51.920 --> 23:53.920]  the idea is just to create some frontend nodes,
[23:53.920 --> 23:56.920]  but not really work nodes.
[23:56.920 --> 23:58.920]  That's right?
[23:58.920 --> 24:00.920]  So you...
[24:00.920 --> 24:05.920]  So if you want to use configless to set up a frontend node,
[24:05.920 --> 24:08.920]  you might need those files from somewhere else.
[24:08.920 --> 24:11.920]  But if you're just creating a container
[24:11.920 --> 24:13.920]  to just interact with Slurm and send Slurm commands,
[24:13.920 --> 24:16.920]  you don't need them, basically.
[24:16.920 --> 24:19.920]  Because the plugin files are usually the...
[24:19.920 --> 24:22.920]  Yeah, the epilogue prologue for the Slurm D
[24:22.920 --> 24:24.920]  or the Slurm CTLD.
[24:24.920 --> 24:31.920]  And that's not what these Slurm client containers are about.
[24:31.920 --> 24:34.920]  So short answer, you usually don't need them.
[24:34.920 --> 24:36.920]  Hello, thank you for the talk.
[24:36.920 --> 24:39.920]  I'm wondering, in huge institutions,
[24:39.920 --> 24:41.920]  like in CERN or EPFL,
[24:41.920 --> 24:54.920]  would you run your own forked or patched Slurm
[24:54.920 --> 24:59.920]  so you could fix maybe the authentication privileges?
[24:59.920 --> 25:02.920]  Or is it just not done because it's...
[25:02.920 --> 25:05.920]  I've never carried any Slurm patches, to be honest.
[25:05.920 --> 25:08.920]  I've always, both at Slurm and at EPFL,
[25:08.920 --> 25:10.920]  we just use Slurm out of the box.
[25:10.920 --> 25:13.920]  It works well enough for our use cases.
[25:13.920 --> 25:16.920]  It is true that you could, for instance, do a patch
[25:16.920 --> 25:21.920]  to enable finer granularity for the permissions.
[25:21.920 --> 25:24.920]  For instance, you could enable any user to pull the config file.
[25:24.920 --> 25:26.920]  That would be a nice patch.
[25:26.920 --> 25:28.920]  We don't do it.
[25:28.920 --> 25:31.920]  Okay, thank you.
[25:31.920 --> 25:34.920]  We have time for one short question.
[25:34.920 --> 25:35.920]  Hi, thanks.
[25:35.920 --> 25:37.920]  We actually are very interested in this
[25:37.920 --> 25:41.920]  because we are applying...
[25:41.920 --> 25:43.920]  We have a Jupyter Hub frontend
[25:43.920 --> 25:47.920]  that actually talks to a Slurm cluster through SSH
[25:47.920 --> 25:49.920]  because we don't want to install all that stuff,
[25:49.920 --> 25:52.920]  like the munch and the full Slurm deployment
[25:52.920 --> 25:54.920]  into the Jupyter Hub host.
[25:54.920 --> 25:57.920]  And I'm wondering, how does it talk actually to Slurm control?
[25:57.920 --> 26:02.920]  So is the Slurm control always listening to any...
[26:02.920 --> 26:05.920]  any of the hosts that will talk to it?
[26:05.920 --> 26:06.920]  Yes.
[26:06.920 --> 26:08.920]  Or is there any restrictions to who is connecting
[26:08.920 --> 26:10.920]  to the Slurm control demo?
[26:10.920 --> 26:15.920]  So there's an alloc nodes setting in the SlurmConf, I believe,
[26:15.920 --> 26:19.920]  which will allow you to restrict from which nodes
[26:19.920 --> 26:21.920]  you can allocate resources.
[26:21.920 --> 26:22.920]  Okay.
[26:22.920 --> 26:24.920]  So you can limit it.
[26:24.920 --> 26:26.920]  However, if you don't have that,
[26:26.920 --> 26:29.920]  the Slurm will happily accept anything
[26:29.920 --> 26:31.920]  because if you have the shared secret,
[26:31.920 --> 26:33.920]  it's considered good enough.
[26:33.920 --> 26:34.920]  Okay.
[26:34.920 --> 26:35.920]  Or a valid JSON web token.
[26:35.920 --> 26:36.920]  Okay.
[26:36.920 --> 26:37.920]  Yeah.
[26:37.920 --> 26:38.920]  Thank you.
[26:38.920 --> 26:39.920]  Thank you very much, Pablo.
[26:39.920 --> 26:40.920]  Thanks.
[26:40.920 --> 26:57.920]  Thank you very much.