Okay, so good afternoon, all both evening, I suppose.
I'm Jamie, so I'm a doctoral student at TU Dartmouth University in Germany and a member
of the LHCB experiment at the Large Hadron Collider.
More importantly for this talk, I'm a user of SnakeMaker's three years and will come
on to what SnakeMaker is and how it's used within our experiment.
To jump straight into the title, so really the first two words of workflow managers,
they do what they say on the turn.
It's tools to manage workflows which is unsurprising from the name.
It's not uncommon in many fields, but particularly in high energy physics, that we have data
of some variety.
We want to apply process, some form of workflow to get some results and ideally a nice paper
out of things.
From this, we can kind of structure our work with a workflow manager.
This involves defining our workflow, organizing or reorganizing the rules in our workflow,
so this kind of flow chart we have running a workflow or miraculously rerunning when
there's a change to the code base or if there needs to be a reproduction of a result, and
also to document a workflow.
So we've had a bit throughout the day about reproducibility, so it also plays into that.
There are a lot of tools on the market.
Here is just a variety for a few different places, some more leaning towards the toolside
of things, some more leaning towards the framework side.
Things like common workflow language.
The focus here today is really going to be around snake make.
So snake make evolved really from the GNU make paradigm, so the workflow is defined from
a set of rules.
These can be related to one another and a directed acyclic graph is generated, which
can just be really thought of as a flow chart linking all of the rules together to get a
set of required target product or results.
We can use wildcards within there that really allows us to create dynamic workflows, thoroughly
encourage having a play around with it.
It's a really fun tool to use, really good tool to use.
It's based on Python, so it's Python compatible, you can in fact use Python functions within
snake make.
This does provide a very shallow learning curve for those already experienced with Python.
It's been in development for a while, it's recently had almost an overhaul of sorts with
version 8 which released at the end of last year and restructured a lot of the functionality
to look more towards the future.
It originally was a large tool in bioinformatics, but within the last five years it's been
picked up in HEP.
Now I've said a lot about HEP and high energy physics, it's probably worth just briefly
touching on what that means, so this really is the physics of the early universe, this
is what we talk about when we talk about particle physics.
For example at the Large Hadron Collider, where the experiment that I work on is based,
we accelerate particles to large fractions of the speed of light and then collide them
to see in these very high temperature, high density environments what the old conditions
of the early universe would have looked like.
There's four main experiments around this ring, which is 27 kilometres of tunnel under
the Swiss French border at Geneva.
By experiment LHCB just in the foreground, other background just next to the airport,
and it's just as easy as looking at the differences between matter and antimatter.
This is our little experiment here, 100 metres under Geneva.
Now analyses in high energy physics really aim to measure some things, so this is a project
all the way through aiming to get a final measurement that might be a massive particle,
the lifetime of a particle or how it decays for example.
We want to contradict ideally the standard model, which is our best understanding of
particle physics.
Any contradictions here would be a sign that some things arise, that there is new physics
out there.
We start off with experimental data, an extractive measurement from that by fitting, selecting
and so on.
There's a number of processes we can apply and these all take place in dedicated scripts,
usually written by analysts.
And since analyses are collaborative projects, these can range from a few authors up to dozens
really in some of the larger analyses, this means that every analysis has a large kind
of shared and dynamic code base.
This may sound familiar, so a software project in many ways has parallels that can be drawn
there.
In terms of what we require from a workflow manager in high energy physics, we can break
it down also like this.
So starting off we need results that need to be reproducible, we can't really afford
to build a second large Hadron Collider, so the best we can do is ensure that our results
from the raw data make sense if we rerun the analysis.
Similarly, if there's a new theoretical input or if we have a change to the scripts, we
may need to rerun this analysis.
Workflow managers really make this very easy.
Excellent, thank you.
The data that we have is often stored remotely and this is because we have absolutely tons
of it.
The scales of data we have are several terabytes per analysis, there can be hundreds of analysis
per experiment ongoing at once, are enormous and this will only get larger.
There's a plot I'll show near the end that shows how much of a daunting case this is,
but we'll come onto that.
So these really can be large scales of data.
The scripts, as I've said, in an analysis can change very frequently and also can be
of very different types.
So these could be in Python, C++, Fortran, it really can range, so you want a flexible
platform where all you need to do is ensure that it runs in the shell and you can then
deploy it within a workflow.
And finally, we need to be very scalable and deployable.
This is both with regards to the amount of data, the amount of authors, and also ensuring
that everyone collaborating on an analysis can contribute to it.
Snakemake actually meets all of these needs, it meets them very well and it's seen a lot
of uptake particularly in my experiment at HCP, to the point where we have a really strong
user base and we've started now to have our internal expertise leading to internal training.
I've linked here a very recent, this took place yesterday, training from a colleague
place.
There's also within our starter kit, which is given to all of our new members, there's
a Snakemake course within there.
So new analysts are trained in this tool so that they can use this right from the off
and since most of them are familiar with Python, that really is getting started from
day one, getting straight into physics.
The features and functionality, suit analysis is really, really well.
So there's interface for HPC resources, the amount of data we have means we can't process
all of it kind of locally, so we do want to make use of the resources we have around the
world and also local clusters.
And also because all of the data is remote, the integration for remote access protocols
within Snakemake enables that in a much more user friendly manner.
So if we start with kind of scalable deployable workflows, so there's a few different ways
that Snakemake approaches this, we can break down our larger workflow into smaller files,
either as wrappers of common snippets or parts of code, or into individual files.
This also helps with maintainability and ensures that one small change in the workflow can't
destroy everything.
Additionally, check pointing within workflows allows for much more flexible definitions,
so if a script is going to produce an unknown number of files, it's not known or it's not
deterministic how the workflow will look from the start, so we can re-evaluate that flow
chart further down the line when that's being produced, so when we know how many files we
have there.
We can also batch the jobs that we have, so if we have a rule that happens many times,
say over many files, we can batch that so we only need to consider as many at time and
that also reduces the overhead locally when running it.
Lastly, just in terms of deployability, so there's integration for specifying package
requirements within Condor so that when you are then using another machine or for another
user that it will run as it does or should run as it does locally.
In terms of distributed computing, so as a large data scale requires large computing
scales, it's not uncommon within analyses even without using workflow managers to use.
Clusters and HPC resources for processing, fitting, it's becoming more and more common
with tools like our data frame which is able to run through Spark.
Snakemake supports common interfaces, so some of the ones just listed on the right hand
side and actually how this is implemented is very straightforward, so a workflow can
be defined exactly as it's done locally and the only additional part is specifying a profile,
so this is typically a job script and a submission script and then in the command line just
an additional flag specifying the profile is given, everything else runs as it would
locally, it's just that the rules then are submitted to the, or the jobs are submitted
to the cluster.
Resource limits can be set, so if you have a job that you're worried is going to need
quite a lot of memory or you're sharing a small number of resources but want to use
central resources, that can help there in a lot of ways.
And finally, if you have a job that maybe you're writing a very small amount of text
or file which you would prefer to run locally where the overhead isn't justified to submit
it to central resources, that can also be specified as a local rule so that regardless
of whether you're running locally or through a cluster, that's, that always runs locally.
And finally in terms of the functionality, so most of our data is stored either on EOS
or this is AtCern or on the worldwide LHC computing grid which is around the, around
the world as in they.
The remote implementation within Snakemake allows for different providers, so we mostly
use X root D, this is the protocol used by EOS and WLCG, but also S3 is becoming a bit
more common in places and that is also supported.
So in both of those cases there's a common, a common implementation where simply add the
provider.remote, excellent thank you.
And this can be wrapped around your path so you just define the rule as you would normally
and add the additional parts on there.
And there's some functionality within there, glob wildcards for if you want to correct a
glob function on these resources rather than having to download all of them in advance and
keep local to avoid repeatedly downloading or streaming files.
This list isn't exhaustive though, I think about eight or nine, possibly more now.
One of the big changes with V8 was to increase the flexibility of the interface there.
Finally just really what do analysts need going forward, so we have this tool that works
very well that we do want to kind of deploy more widely.
There's lots of discussions within the community as to how we do that.
Going forwards, scalability is really probably our biggest issue.
So this is from the Atlas experiment, we are where the cursor is.
In five years time we have five, six times more data than we already have as well as
the data we already do have.
This will keep on going, so we'll have the high luminosity error which is what this large
increase from about 2028 will be where we take enormous amounts of data compared to
what we currently have.
So we need to be able to handle this and we need the tools to handle that.
The experiments are already, LHCB used to be the smallest of the experiments, it's technically
I think still the smallest and sits around a thousand authors.
This grows by about 100 every year, so we need to be able to deploy for that many authors
across kind of analyses.
In terms of usability, analysts typically aren't software devs by trade, so having a
tool that is very user friendly is super helpful in having these implementations that work
very well out of the box adds to that.
So going forward they're kind of continuing the good work there.
And finally in terms of functionality, really what would help is further collaboration and
there's already been quite a bit already between devs and HEP users.
So for example the pull request on GANGA which is a service that we use in LHCB for cluster
interface to implement that as an executor in the same format that was shown on a couple
of slides before.
So to draw some conclusions, workflow managers really are very useful to our research, they
save us a lot of time in making sure that everything's run in the right order, that
nothing is left behind and that we can look back and see what's been run.
The tools do meet our needs, the functionality is there and in fact the use of these tools
will become unavoidable.
Within the next few years the scales of data we have will mean we have to use tools like
Snakemake.
And finally we have a very strong user base, it's a bit spread across the experiments,
not necessarily all in Snakemake, so CMS use Luigi which was on the much earlier slide
and bringing that together to collaborate both across the HEP community and also with
devs really is also crucial in moving that forward.
So I've left a few links on there, the training from earlier and I would otherwise welcome
any questions.
One question I have for you, so a happy user of Snakemake, one thing is sometimes challenging
is the visualization of the data and the dedication flow, how do you handle this especially when
you have thousands of nodes in the data?
So to repeat the question, so within Snakemake there are ways to visualize it, this can be
very difficult with large scales of workflow, so the question is how to approach that.
Often this can be done with grouping, that helps in a lot of cases for very large rules
that then when visualizing the DAG this narrows everything down.
I believe that's both the DAG and pipeline modes, one expands to every job and one just
goes to individual rules, so one is naturally a lot narrower.
Really I would say between batching group, those that have been the ways I've personally
dealt with it, I'm not sure from the outside or elsewhere in the user base, but that would
be how I've gone about it.
Sorry, I think we have to move on to the next talk, so thank you very much for joining me.