[00:00.000 --> 00:15.720]  Okay, the next talk is a perfect fit after a previous talk.
[00:15.720 --> 00:21.960]  So Olivier and Axel are going to talk about Troika, a system to easily manage, submit
[00:21.960 --> 00:24.680]  your jobs to any HPC system.
[00:24.680 --> 00:28.520]  Yeah, thanks for inviting me and thanks for letting me talk.
[00:28.520 --> 00:36.040]  So yeah, Troika, as Kenneth said, is a system so that we can interact with job submission
[00:36.040 --> 00:38.640]  systems with one given interface.
[00:38.640 --> 00:42.000]  So just before I start, a bit of context where I work.
[00:42.000 --> 00:48.840]  So I work for the European Center for Medium Range Weather Forecasts, which is a European-based
[00:48.840 --> 00:51.120]  international organization.
[00:51.120 --> 00:56.400]  And we run an operational weather forecasting service four times a day that we send out
[00:56.400 --> 01:01.960]  to national meteorological services and private customers.
[01:01.960 --> 01:10.920]  So we also operate quite a variety of services, like we have our own in-house research to
[01:10.920 --> 01:15.480]  improve the models, to do climate analysis, reforcasts.
[01:15.480 --> 01:21.320]  We operate services linked to climate change, for instance, as part of the EU Copernicus
[01:21.320 --> 01:27.080]  Service, and we've just started a new project called Destination Earth.
[01:27.080 --> 01:32.120]  So I'll talk a bit more about that because it's a nice entry to what I will present.
[01:32.120 --> 01:36.000]  So it's a EU program for weather and climate.
[01:36.000 --> 01:41.040]  It's a large collaboration that we drive with ESA, the European Space Agency and UMEDSAT,
[01:41.040 --> 01:44.920]  the European Meteorological Satellite Organization.
[01:44.920 --> 01:48.800]  And the goal is to run simulations of the Earth at one kilometer resolution.
[01:48.800 --> 01:56.440]  So for those who are wondering, that's about 256 million points per vertical level.
[01:56.440 --> 02:02.920]  So this project is quite big, and it will run on multiple HPC systems across Europe.
[02:02.920 --> 02:14.240]  So for instance, I think Barcelona with BSC and Lumi in Finland, just to name two.
[02:14.240 --> 02:19.520]  And that means we will require some level of flexibility to run our workflows.
[02:19.520 --> 02:26.880]  So you notice I didn't say job because in weather forecasting and also for these projects,
[02:26.880 --> 02:31.160]  we have lots of different tasks that we run together.
[02:31.160 --> 02:37.040]  So here you can see an overview of what we run operationally.
[02:37.040 --> 02:43.960]  But in practice, that's a few thousand tasks that run every time we want to run one of
[02:43.960 --> 02:47.120]  these pipelines.
[02:47.120 --> 02:50.520]  And we have multiple types of workflows in-house.
[02:50.520 --> 02:53.800]  So the main one is the operational one, of course.
[02:53.800 --> 02:58.160]  But then researchers have their own workflows.
[02:58.160 --> 03:03.920]  We have support workflows like CICD, deploying software, or just fetching data and analyzing
[03:03.920 --> 03:06.360]  data and things like that.
[03:06.360 --> 03:12.720]  And that amounts to about half a million tasks per day on our HPC cluster.
[03:12.720 --> 03:19.120]  And so sometimes we run parallel jobs, but most of those tasks are just small, like one
[03:19.120 --> 03:25.040]  CPU or a few CPU tasks, just to do some processing.
[03:25.040 --> 03:30.040]  So for that, we use a workflow manager that we developed called ECFlow, which basically
[03:30.040 --> 03:34.280]  manages a task graph as a tree with additional dependencies.
[03:34.280 --> 03:39.520]  So you can have dependencies on dates, loops, and things like that.
[03:39.520 --> 03:42.520]  And that runs a script for every task.
[03:42.520 --> 03:47.160]  So a task being one leaf in the tree I show here.
[03:47.160 --> 03:52.320]  It stores variables for pre-processing if needed, keeps track of the task status, fetches
[03:52.320 --> 03:54.760]  log files on demand.
[03:54.760 --> 04:01.280]  What it doesn't do to keep it simple is connect to remote systems and talk to specific queuing
[04:01.280 --> 04:02.280]  systems.
[04:02.280 --> 04:10.480]  So ECFlow just runs commands on the server host, which is usually VM, and provides three
[04:10.480 --> 04:16.720]  entry points, which are submit, monitor, and kill for every task.
[04:16.720 --> 04:21.440]  And so if you want to run an actual job on an HPC system, that means you have to have
[04:21.440 --> 04:22.640]  some kind of interface.
[04:22.640 --> 04:27.880]  So first you can start by just saying, oh, yeah, the command is SSH to my cluster and
[04:27.880 --> 04:31.160]  submit a job, and that's it, which works.
[04:31.160 --> 04:36.640]  But when you change cluster, or even like there is an option to put, you're in trouble
[04:36.640 --> 04:42.000]  because you have to change that variable, and it can be a bit painful, especially if
[04:42.000 --> 04:47.720]  you have thousands of tasks, or you don't want to regenerate the whole workflow.
[04:47.720 --> 04:51.520]  So next possible thing, you write a shell script.
[04:51.520 --> 04:54.120]  So you could do multiple actions in your script.
[04:54.120 --> 04:59.200]  You have a bit more flexibility, but I don't know if you tried handling configuration in
[04:59.200 --> 05:00.200]  a shell script.
[05:00.200 --> 05:03.520]  Usually it ends up quite easily into a nightmare.
[05:03.520 --> 05:10.000]  It's very hard to maintain, and if you deal with several people, everyone has their own.
[05:10.000 --> 05:16.320]  So we tried to have something a bit cleaner, and so we want to delegate it to a submit
[05:16.320 --> 05:23.800]  interface that can be made generic, gives you lots of flexibility, and you can also
[05:23.800 --> 05:29.960]  maintain it as a proper piece of software that means versioning, testing, and some level
[05:29.960 --> 05:33.800]  at least of reproducibility.
[05:33.800 --> 05:39.640]  So we call our software Torica because it runs mainly those three actions, submit, monitor,
[05:39.640 --> 05:41.800]  and kill.
[05:41.800 --> 05:48.440]  It's able to handle remote connection to a remote system, mostly using SSH.
[05:48.440 --> 05:55.280]  It's also able to prepare the job script for submission, interact with a queuing system,
[05:55.280 --> 05:59.000]  and optionally you can run hooks at diverse points.
[05:59.000 --> 06:01.440]  So it's written in Python.
[06:01.440 --> 06:08.120]  We put a strong emphasis in making it configurable so everything can be driven by configuration.
[06:08.120 --> 06:11.080]  I'll show how this works afterwards.
[06:11.080 --> 06:16.120]  And we want it to be extensible, so you can add new connection methods if running locally
[06:16.120 --> 06:20.440]  on your server node or running over SSH isn't enough.
[06:20.440 --> 06:27.360]  You could just add a plug-in if you want to support another queuing system, same.
[06:27.360 --> 06:32.040]  And if you want to add some hooks, for instance, to create directories before your job runs
[06:32.040 --> 06:39.280]  or copy files over before or after submitting a job, et cetera, you can also do it.
[06:39.280 --> 06:42.240]  So as an example, that's how you would run Torica.
[06:42.240 --> 06:47.920]  So it has quite a simple command line interface where you can control most of the flags you
[06:47.920 --> 06:52.320]  will need in your day-to-day life.
[06:52.320 --> 06:57.000]  So you choose the action you want to do, submit, monitor, or kill.
[06:57.000 --> 07:03.160]  You give it a machine name which is defined in configuration.
[07:03.160 --> 07:08.480]  Some options like the user, you tell it where to write the output file because that will
[07:08.480 --> 07:11.240]  stay on the server.
[07:11.240 --> 07:17.240]  And it serves as a reference if you want to copy some other files, they would be put
[07:17.240 --> 07:18.720]  alongside this one.
[07:18.720 --> 07:24.440]  And so here you can see the log below that shows the commands that would be actually
[07:24.440 --> 07:27.800]  executed when doing that.
[07:27.800 --> 07:31.840]  So as I said, everything that is configurable.
[07:31.840 --> 07:38.160]  So each site has a name to identify it on the command line.
[07:38.160 --> 07:44.840]  And then you define the connection type, local, SSH, whatever you want to add, a type.
[07:44.840 --> 07:49.560]  So for now we support direct execution, slum, and PBS.
[07:49.560 --> 07:54.400]  And then you can add some hooks, for instance, oh, yeah, before I start doing anything, check
[07:54.400 --> 08:00.080]  the connection just to see whether it will actually work, or, oh, yeah, before submitting
[08:00.080 --> 08:08.800]  the script, just make sure the directory containing the output file exists, or once
[08:08.800 --> 08:14.960]  the job is submitted, copy the log file to the server so that we can see everything in
[08:14.960 --> 08:21.000]  the same place rather than having files scattered around every system.
[08:21.000 --> 08:26.880]  And so that's all good, but just having an alias to SBatch that does it remotely is not
[08:26.880 --> 08:27.880]  really helpful.
[08:27.880 --> 08:35.200]  So we need also to modify the job script to add some options that are understandable by
[08:35.200 --> 08:37.000]  the submission system.
[08:37.000 --> 08:47.120]  So for that we decided to have a new language, because obviously the directives are not interoperable
[08:47.120 --> 08:50.840]  across submission systems.
[08:50.840 --> 08:54.520]  And so we need some kind of translation.
[08:54.520 --> 09:02.720]  We input some generic directives, and we can add some in the configuration as well.
[09:02.720 --> 09:10.680]  And then we translate them, so either for things very simple, like, oh, yeah, the output
[09:10.680 --> 09:18.160]  file in PBS is minus O in slum, it's minus, minus output.
[09:18.160 --> 09:24.800]  So this kind of translation could have also plugins that compute resources, like if someone
[09:24.800 --> 09:28.160]  gives you the number of nodes and the number of tasks per node, and you need the total
[09:28.160 --> 09:34.480]  number of tasks, things like that, so you could add plugins, or if you have some specific
[09:34.480 --> 09:39.960]  resource management in your HPC, you can add that as well.
[09:39.960 --> 09:44.640]  And then on the output side, we have a generator that's site-specific, again, because we need
[09:44.640 --> 09:48.280]  to adapt the directives to the system.
[09:48.280 --> 09:53.120]  It can make the last few translations, for instance, the actual syntax of some options,
[09:53.120 --> 10:00.560]  like mail options, most submission systems allow you to specify an email address to which
[10:00.560 --> 10:04.240]  send an email for some of your tasks.
[10:04.240 --> 10:09.480]  Only the syntax is slightly different for everyone, so it does that translation, and
[10:09.480 --> 10:18.240]  it's able to add code, if you need, for instance, to define environment variables in your software.
[10:18.240 --> 10:25.360]  So the main components that are extensible in Troika are, as I said, the interaction
[10:25.360 --> 10:27.200]  with the queuing system.
[10:27.200 --> 10:35.440]  So you have a parser that reads the native directives so that you can use them if you
[10:35.440 --> 10:43.560]  need them for your processing, generates the job script, it runs the appropriate commands,
[10:43.560 --> 10:55.480]  so either using QSOB, SBATCH, or whatever, it could use APIs if you have another system.
[10:55.480 --> 11:00.520]  And it can also keep track of the submission, so most of the time, just keeping a job ID
[11:00.520 --> 11:06.240]  so that if you want to monitor the task, you just say, oh, yeah, the script was this, and
[11:06.240 --> 11:11.760]  Troika will know, oh, yeah, put the job ID in that file next to the script.
[11:11.760 --> 11:16.400]  I don't need you to tell me where it is.
[11:16.400 --> 11:22.840]  And so you can choose how you want to interact and define new interfaces if you want.
[11:22.840 --> 11:24.200]  Same for the connection.
[11:24.200 --> 11:30.440]  So the connection mostly does the running of commands on the remote system.
[11:30.440 --> 11:36.440]  It's able to copy files over, if needed, both ways.
[11:36.440 --> 11:44.560]  And you can have some hooks at various points at start-up just before submitting, just after
[11:44.560 --> 11:48.480]  killing a job, for instance, if you want to tell a workflow manager that, oh, this task
[11:48.480 --> 11:54.560]  doesn't exist anymore, I just killed it, or at exit if you want to move your log files
[11:54.560 --> 11:56.640]  around, for instance.
[11:56.640 --> 12:01.000]  And that allows you to perform extractions.
[12:01.000 --> 12:04.780]  And then the last thing you can customize is the translation.
[12:04.780 --> 12:10.920]  So if you want to generate more directives than the user provided, you can also do it.
[12:10.920 --> 12:17.120]  And basically, you just pass a function that takes the input set of directives and updates
[12:17.120 --> 12:21.560]  that set to whatever you need.
[12:21.560 --> 12:27.240]  So as a bit of a success story for us, so we've just switched to a new HPC with a new
[12:27.240 --> 12:33.120]  set of EC flow server VMs, new location, new everything.
[12:33.120 --> 12:38.960]  So it's much simpler to actually be able to just change a config file rather than rewrite
[12:38.960 --> 12:43.840]  a whole shell script that does all the submission for us.
[12:43.840 --> 12:48.160]  And also, since we have lots of different users, they have different needs, they have
[12:48.160 --> 12:50.080]  different ways of working.
[12:50.080 --> 12:55.960]  And what we managed to do with Troika is that we managed to bring them all together to use
[12:55.960 --> 13:02.160]  a single tool, which runs the operational workflows where they need to have tight control
[13:02.160 --> 13:07.040]  over what they actually submit and all the options.
[13:07.040 --> 13:12.680]  Research workflows, which need to be very flexible because every researcher might have
[13:12.680 --> 13:17.240]  their own specific needs, but in the end, they run mostly the same kind of code.
[13:17.240 --> 13:22.400]  So we need to have an interface that allows that.
[13:22.400 --> 13:25.320]  And then we run also general purpose servers.
[13:25.320 --> 13:30.560]  If someone has a data processing pipeline, for instance, they can just spawn a server
[13:30.560 --> 13:33.480]  and do their work.
[13:33.480 --> 13:38.120]  And that needs to have an easy to use interface because we don't want to teach people, oh,
[13:38.120 --> 13:42.040]  yeah, you also need to know that to run your job.
[13:42.040 --> 13:47.640]  So now what we do is we provide them with VMs where Troika is pre-installed, and many
[13:47.640 --> 13:52.120]  of them just don't even notice that it's there.
[13:52.120 --> 13:58.520]  And as a summary, so I said at the beginning that we handle about half a million jobs per
[13:58.520 --> 14:05.960]  day, and most of them now pass through Troika, and it hasn't failed yet, so hopefully it
[14:05.960 --> 14:08.120]  works well enough.
[14:08.120 --> 14:13.960]  What it will help us with also going forward is supporting our software development.
[14:13.960 --> 14:18.600]  So it's not necessarily tied to a workflow manager.
[14:18.600 --> 14:25.240]  We want to control our CI CD pipeline also using that because some of the elements of
[14:25.240 --> 14:28.680]  the pipeline have to run on our HPC system.
[14:28.680 --> 14:34.120]  So basically what we could do is from a GitHub runner, we could use Troika to connect to
[14:34.120 --> 14:41.600]  our HPC run jobs there to do testing, deployment, and everything.
[14:41.600 --> 14:46.120]  We, as I said, run our in-house workflows, and we will continue to do that for the foreseeable
[14:46.120 --> 14:47.120]  future.
[14:47.120 --> 14:53.560]  It will help us to adapt to new HPC systems because every time we make a tender, any provider
[14:53.560 --> 15:01.160]  could answer, and we don't control which submission system we will end up with, and even which
[15:01.160 --> 15:06.600]  site-specific variants there will be in the set of options.
[15:06.600 --> 15:12.920]  And then for destination Earth, as I mentioned before, we want to support multiple HPC with
[15:12.920 --> 15:18.120]  minimal changes to the code.
[15:18.120 --> 15:25.040]  And so just to tell a bit more, where do you want, we want to go from here.
[15:25.040 --> 15:33.400]  So we want to support more queuing systems because, I mean, we support two, and one of
[15:33.400 --> 15:38.520]  them quite well because if we use it, the other one a bit less maybe.
[15:38.520 --> 15:44.280]  We want also to add functionality to inquire about the submission systems of, for instance,
[15:44.280 --> 15:49.200]  which are the queues available, the petitions, things like that, so that the user doesn't
[15:49.200 --> 15:58.120]  need to go to the server, check before running, like you could just run a command that fetches
[15:58.120 --> 16:07.040]  all the information in a useful way and gives it to you without, we're abstracting basically
[16:07.040 --> 16:10.240]  the specifics.
[16:10.240 --> 16:14.920]  We also want to add some generic resource computation routines.
[16:14.920 --> 16:20.840]  So we have some in-house, but they are very tied to the way we function, and so there
[16:20.840 --> 16:26.440]  will be some work to make it more generic and then integrate it in the main source code
[16:26.440 --> 16:30.080]  rather than in a plugin.
[16:30.080 --> 16:35.240]  And for improvements to the code, we want to improve script generation.
[16:35.240 --> 16:38.400]  For now it's a bit clunky, but it works.
[16:38.400 --> 16:45.400]  We want to widen the coverage because you never test enough and provide packages to
[16:45.400 --> 16:52.280]  install it on Debian-based machines, for instance, or RPMs for Red Hat systems, et cetera.
[16:52.280 --> 16:57.520]  And if you want to contribute, feel free to talk to me or go to our GitHub page and I'll
[16:57.520 --> 17:10.400]  stop for now and take questions.
[17:10.400 --> 17:18.400]  Hello, thanks for the presentation.
[17:18.400 --> 17:24.440]  So basically I've done something quite similar for my employer, sadly it cannot be open sourced,
[17:24.440 --> 17:30.280]  but the problem that we have is we have legacy clusters with legacy job submission systems.
[17:30.280 --> 17:36.840]  How did you manage to get the traction to migrate to Troika and to convince the user
[17:36.840 --> 17:43.000]  to port their jobs, their developments to this new system?
[17:43.000 --> 17:47.160]  So what we did first is that we made it as seamless as possible.
[17:47.160 --> 17:52.240]  So if you want to interact with your job submission system without using our directives, you can.
[17:52.240 --> 17:58.800]  They will just pass through, but you lose on the generosity.
[17:58.800 --> 18:07.200]  And then what helped us is that we changed our HPC system, and that means we did basically
[18:07.200 --> 18:12.160]  start afresh and everyone had to make changes, so we just pushed that onto them.
[18:12.160 --> 18:18.840]  And I must say many of them have been happy because that meant we can do that for them
[18:18.840 --> 18:24.960]  rather than them having to figure out the details of how do they submit jobs on that
[18:24.960 --> 18:27.240]  new system and everything.
[18:27.240 --> 18:30.640]  We can just tell them, oh yeah, it's reinstalled, it works.
[18:30.640 --> 18:32.960]  And so yeah, that has been really helpful.
[18:32.960 --> 18:35.280]  I actually have a follow-up question to that.
[18:35.280 --> 18:40.760]  So one thing we have been doing, we switched recently, well, four or five years ago from
[18:40.760 --> 18:46.040]  Torq to Slurm, and we didn't want to let all our users retrain themselves and learn
[18:46.040 --> 18:50.240]  the Slurm commands, because in our experience, Slurm is a bit less user-friendly than Torq
[18:50.240 --> 18:51.240]  is.
[18:51.240 --> 18:55.120]  So what we did is we rolled a wrapper that people can still use QSub, but they're actually
[18:55.120 --> 18:59.720]  submitting to Slurm, and it just, it translates the script in the background.
[18:59.720 --> 19:01.160]  Troika doesn't do that now, right?
[19:01.160 --> 19:05.920]  You have to use the Troika command, but it knows about the Slurm header.
[19:05.920 --> 19:09.000]  Yeah, so you could technically do it.
[19:09.000 --> 19:16.000]  We didn't want to encourage that, but technically you could, like, I think you could write a
[19:16.000 --> 19:21.240]  script in three lines, a plugin that just takes the directives.
[19:21.240 --> 19:26.560]  You would probably need to support all the directives you need, but we have a built-in
[19:26.560 --> 19:31.360]  parser that is able to read, like, Slurm commands, for instance.
[19:31.360 --> 19:37.280]  And so you just need to tell Troika, oh yeah, use those on top of whatever is specified
[19:37.280 --> 19:39.280]  in configuration.
[19:39.280 --> 19:41.600]  Is that something you would take pull requests on?
[19:41.600 --> 19:43.600]  Yeah, if you want to.
[19:43.600 --> 19:46.200]  Okay, we had another question.
[19:46.200 --> 19:47.200]  Yeah.
[19:47.200 --> 19:50.200]  Passed them away.
[19:50.200 --> 19:57.840]  Hi, thank you for the presentation, very interesting.
[19:57.840 --> 20:05.200]  So I'm an early programmer myself, and so my question for you is, how does it fail?
[20:05.200 --> 20:11.200]  Like, have you studied or provoked, you know, intentional failures of the system, and have
[20:11.200 --> 20:18.280]  you encountered funny behaviors, like, or plain hilarious faults of the system?
[20:18.280 --> 20:24.560]  Yeah, we had, I mean, getting a new system has its lot of failures, so I don't know if
[20:24.560 --> 20:30.760]  Axel, you want to take over for that, because you probably have handled some of the failures.
[20:30.760 --> 20:37.040]  In the example of the command line provided, you can see that we redirect the output for
[20:37.040 --> 20:43.440]  each submission, and this is a chance to analyze the submission and to decide what's the best
[20:43.440 --> 20:48.880]  approach to deal with erroneous submission, meaning that some of them have to be reflected
[20:48.880 --> 20:52.480]  the hard way to make it clearly visible, this is a problem.
[20:52.480 --> 20:58.640]  And some others can be handled in a hidden way, or not so visible way, in a still deterministic
[20:58.640 --> 21:05.360]  way, and so it may be hidden and still automatically handle the problems when they occur.
[21:05.360 --> 21:11.240]  And this is what we expect with so many jobs to submit, to focus on the critical essential
[21:11.240 --> 21:17.040]  for the human side, and to have a chance to teach the machines through the hook system
[21:17.040 --> 21:22.240]  to manage with the specificities we have identified as problematic, but we want to keep ignored
[21:22.240 --> 21:27.520]  or manage automatically until a fix is coming from the curing system, for example, if it
[21:27.520 --> 21:36.960]  is related to a curing system problem or identified issues that may come with the next release.
[21:36.960 --> 21:45.120]  So this is a way to deal with the failures that can occur at job submission.
[21:45.120 --> 21:53.480]  Thank you.
[21:53.480 --> 22:00.320]  Did I understand correctly that when you're monitoring a job, the reference is the script?
[22:00.320 --> 22:01.320]  Yes.
[22:01.320 --> 22:02.320]  Correct.
[22:02.320 --> 22:08.040]  So that means everyone has to make sure their scripts are uniquely named each time, otherwise,
[22:08.040 --> 22:12.080]  or is it the sort of script and where it is in the file system?
[22:12.080 --> 22:14.480]  It's where it is in the file system.
[22:14.480 --> 22:16.040]  So you are correct.
[22:16.040 --> 22:21.440]  If someone deletes or renames their script, then it can cause a problem.
[22:21.440 --> 22:24.680]  Submits with the same script.
[22:24.680 --> 22:31.840]  So it's not a problem for us because our workflow manager basically does some pre-processing,
[22:31.840 --> 22:39.120]  meaning that the script has some additional things like, oh, yeah, it's your second try
[22:39.120 --> 22:44.960]  at that submission, so I will add.job2 at the end.
[22:44.960 --> 22:50.520]  And so that's how we circumvent this issue, but you are definitely correct, and that's
[22:50.520 --> 22:53.920]  something we will need to improve at some point.
[22:53.920 --> 23:01.760]  But we didn't want to have to link to a database or something so that we can keep it simple.
[23:01.760 --> 23:03.120]  Thanks.
[23:03.120 --> 23:05.760]  You could just copy the script on submission, no?
[23:05.760 --> 23:06.760]  We could copy it.
[23:06.760 --> 23:13.080]  It's just that, yeah, if you have half a million scripts, we need to think, per day, we need
[23:13.080 --> 23:14.920]  to think about cleanup.
[23:14.920 --> 23:15.920]  Yeah.
[23:15.920 --> 23:20.920]  Other questions?
[23:20.920 --> 23:23.720]  Hello.
[23:23.720 --> 23:27.040]  Users like things to be as simple as possible.
[23:27.040 --> 23:31.840]  In order to do that, they would probably be nice to have some sort of central location
[23:31.840 --> 23:36.880]  where recipes of various clusters would be sort of combined accessible for people to
[23:36.880 --> 23:38.080]  be able to get access to.
[23:38.080 --> 23:41.200]  Is that in your plan or?
[23:41.200 --> 23:43.000]  What do you mean under configuration side?
[23:43.000 --> 23:48.120]  So I could imagine a user turning up going, oh, I'm going to download Troika, and I'm
[23:48.120 --> 23:50.400]  going to talk to this cluster that I have access to.
[23:50.400 --> 23:52.000]  How do I get the configuration?
[23:52.000 --> 23:53.000]  Oh, OK.
[23:53.000 --> 23:54.360]  I see.
[23:54.360 --> 24:00.360]  So we don't have that, but if Troika gets attraction, I think we could come up with a
[24:00.360 --> 24:06.760]  website where you can host your configuration files or have some kind of index where you
[24:06.760 --> 24:08.560]  can list them.
[24:08.560 --> 24:16.520]  I think we would have all that's needed to do that pretty easily.
[24:16.520 --> 24:23.240]  I think, hopefully, the configuration is easy enough so that you don't need to do much
[24:23.240 --> 24:26.800]  on top of what's actually provided as examples.
[24:26.800 --> 24:28.040]  But yeah, you are correct.
[24:28.040 --> 24:35.280]  We could, if it gets popular, just provide configuration files for several systems or,
[24:35.280 --> 24:44.520]  I mean, HPC system providers could also just give a configuration file with the system so
[24:44.520 --> 24:48.040]  we can have it where Troika is installed and then the user doesn't even need to bother
[24:48.040 --> 24:53.520]  about it.
[24:53.520 --> 24:54.520]  Very small second one.
[24:54.520 --> 25:00.080]  Given you've just done all this stuff, have you heard of a project called DRMAA, Distributed
[25:00.080 --> 25:06.240]  Resource Manager Application API, it might make the insides of this slightly nicer for
[25:06.240 --> 25:15.040]  your EC flow stuff, maybe it might take some inspiration for that.
[25:15.040 --> 25:19.560]  Thank you.
[25:19.560 --> 25:21.200]  A question, but also an observation.
[25:21.200 --> 25:25.720]  A long time ago, there was a standard called DRMAA, it was an API.
[25:25.720 --> 25:26.720]  Just mentioned.
[25:26.720 --> 25:32.160]  It seems not to be used, maybe I'm wrong, but very quickly, your system, if you had
[25:32.160 --> 25:36.200]  cloud-based resources on AWS, you've got an SSH connector.
[25:36.200 --> 25:40.480]  Could you have, in the future, maybe run up some machines on AWS?
[25:40.480 --> 25:42.560]  Yeah, that could be an option.
[25:42.560 --> 25:50.440]  As long as you can write Python code to spawn up an image, a container somewhere.
[25:50.440 --> 25:51.840]  Yeah, sure.
[25:51.840 --> 25:57.760]  I think the API is for that, that just needs to be a plug-in that does the connection,
[25:57.760 --> 25:58.760]  and that's it.
[25:58.760 --> 25:59.760]  Cool.
[25:59.760 --> 26:02.320]  Okay, we're out of time.
[26:02.320 --> 26:03.320]  Just a comment.
[26:03.320 --> 26:07.240]  I don't think you have any people using Troika outside of ECMEF.
[26:07.240 --> 26:09.760]  No, that's the first time we actually presented outside.
[26:09.760 --> 26:10.760]  All right.
[26:10.760 --> 26:11.760]  Good.
[26:11.760 --> 26:14.440]  So you're trying to start, or trying to get people to start using it?
[26:14.440 --> 26:15.440]  Yes.
[26:15.440 --> 26:17.440]  You're building a community, you're getting yourself into trouble.
[26:17.440 --> 26:22.720]  We're going to get public requests and bug reports, but okay.
[26:22.720 --> 26:23.720]  Thank you very much.
[26:23.720 --> 26:24.720]  Thank you.
[26:24.720 --> 26:25.720]  Very nice.
[26:25.720 --> 26:26.720]  Thank you.
[26:26.720 --> 26:27.720]  I'll just switch.
[26:27.720 --> 26:28.720]  Okay.
[26:28.720 --> 26:29.720]  Okay.
[26:29.720 --> 26:30.720]  Okay.
[26:30.720 --> 26:31.720]  Okay.
[26:31.720 --> 26:32.720]  Okay.
[26:32.720 --> 26:33.720]  Okay.
[26:33.720 --> 26:34.720]  Okay.
[26:34.720 --> 26:35.720]  Okay.
[26:35.720 --> 26:36.720]  Okay.
[26:36.720 --> 26:37.720]  Okay.
[26:37.720 --> 26:47.720]  Okay.