The Institute which tries to promote best practice in research software across the UK.
And earlier this year we worked on this project called Cats, the Climate Aware Task Scheduler
that we'd like to talk to you about today.
So very simply the idea behind Cats is to try and time shift when we do our compute
to the times when the carbon emissions of producing electricity are at their lowest.
We are probably aiming this at smaller to mid-size HPC and HTC systems that are not
100% loaded and therefore we have that flexibility to time shift.
If you've got a super busy system that's always at 100% then there's not much you
can achieve by time shifting your compute.
I'm sure that most of you are familiar with the very pressing need as to why this is important
but carbon dioxide levels are now getting higher than we can tell they have ever been
for the last 800,000 years.
Before that our records are not so clear because we don't have things like ice cores
readily available but it looks like this is very, very much human caused and because
of burning fossil fuels and this is causing a rather dramatic and alarming increase in
temperature.
Very worryingly in 2023 we saw temperatures going dramatically off the scale and with
a far bigger jump than we had seen in any prior year.
We get to see if that trend is going to repeat last year but I don't know how many of you
have been to Fozden before but for me this is the warmest Fozden I've ever been to and
I think this is my fifth one.
Most specifically in activation.
Have I gone?
No, that's going.
Is that working?
Yes.
We don't want to set the world on fire from doing our computational work but that work
is important and we would like to do it but having the most minimal impact we can while
we do so.
So our plan is, as I said, to time shift our compute to when electricity has the lower
carbon intensity.
We focus this very much on the UK because that's where at least most of us were based
and where we met to come up with this idea.
The UK has a very, very variable level of carbon intensity in its electricity.
Some other countries are not so variable but we have quite a lot of wind power now and
quite a bit of solar but it's always windy and it's not always sunny.
Carbon intensity can actually vary from in some regions as low as zero grams of carbon
dioxide per kilowatt hour to about 400.
The average across the whole of the EU in 2022 was about 250.
And we have some huge regional variations within the UK.
Scotland is normally very green and very low carbon because it's got a lot of wind power,
a lot of hydro, not that many people demanding it, not that much industry compared to England.
And although it is interconnected into England, those interconnected are of limited capacity
so all of that electricity can actually be made to other parts of the country.
Conversely, the south of England has a very high population density and is very dependent
on gas power still.
It does have a bit of solar power because it's the sunniest part of the country and
a little bit of wind but not so much as Scotland by any means.
As we do have this relations connection, we have a lot of international interconnections
now.
A new one just came online to Denmark.
Another new one came online to Norway a couple of years ago and an additional one to France.
We've had another one to France since the 1980s and there's also interconnected Belgium,
the Netherlands and Ireland.
But again, those represent maybe 30% of typical generation capacity.
Unfortunately in the UK we also have this great web API called carbonintensity.org.uk
that provides us with regionalised forecasts for the next 48 hours with 30 minute time
windows.
And it has both JSON and XML APIs that will allow us to interrogate this data and very
easily get hold of what our regional forecasts are and it will show us how things are performing
both against the forecast and previous measurements.
So just to give an example of some of the data we get out of this site, this is an example
of a really good day in October last year where the whole country was green, which I
think means under 75, no, maybe 100 grams of CO2 per kilowatt hour and about half the
country is dark green, which means I think under 35.
And we can see some of these regional variations.
So just comparing two regions here, these are two regions where my employer has offices.
So in the north Wales and Merseyside region, which is this one here, we had only grams
of CO2 per kilowatt hour.
In the south of England we had 92, which still for the south of England is very low.
But if you look at it later, the situation had changed drastically and we now have 289
grams in the south and 235 in north Wales.
So if we could have made sure our compute jobs ran on the 6th instead of the 9th, we
could have reduced the amount of carbon being put into the atmosphere to run those jobs
quite considerably.
So just thinking about how much this might actually save us, imagine we have a fictional
HPC or part of an HPC, which has 64 core AMD epic CPUs.
There's 10 nodes, a 1280 cores in total.
And we reckon each one of those will use about 255 watts fully loaded or 37.5 idle.
So if we can bring that node down from full to idle, assuming we don't turn it off or
suspend it or anything clever like that, then we're looking at around 217 watt per CPU saving.
If we can timeshift from when the grid would be at 200 grams of CO2 per kilowatt hour to
50, that's 150.
If we had a 12 hour job that used all those 1280 cores, that's around 50 kilowatt hours,
which equates to about 7.5 kilograms of carbon dioxide, which is the same as driving a car
50 kilometers.
So how many of us might consider not driving to work or not driving a 50 kilometer route
because it might reduce the environmental impact?
How many of our employers might also have a policy that says you shouldn't drive those
kind of distances, you should take public transport?
Well, why shouldn't we have the same policies for compute?
If we can achieve similar savings, we should be doing that for our compute systems as well
as for our travel.
I'd just think about this across the wider world.
There was a paper published in, I think, 2021, looking at the potential savings of doing
time shifting for AI based jobs in cloud providers, just showing us the different levels of savings
that could be achieved in different regions.
So as you see, these vary quite dramatically from region to region, normally depending on
how much renewable energy is available in that region and how bad the alternative is
when they are not using renewable energy as well.
Now, a lot of people might suggest, well, the grid is going to go net zero soon anyway.
Is it really important to do this?
And the UK has put out a set of future scenarios about how they might achieve their net zero
transition.
One of those scenarios, though, is where we don't actually achieve it.
And it's not quite clear whether we're actually on that pathway or one of the more optimistic
ones at the moment.
But even if we do achieve this, and their target is by 2035, we should actually be near
net zero, that's still another 11 years of us putting carbon dioxide into the atmosphere
when we use electricity, when we could reduce that impact now if we do something with time
shifting.
So let's do something now instead of waiting 11 years for someone else to solve the problem
and possibly not solve the problem in that time.
There is also a financial incentive to do this.
There are starting to be variable rate electricity tariffs that roughly reflect the carbon intensity
because wind and solar power don't have any fuel costs, so they're much cheaper to offer.
And if you've got your own electricity production like rooftop solar or your own wind farm, then
there's also a saving to use that power, and it's normally better to use it than to export
it and be paid normally quite poor rates for that export.
So I'm now going to hand over to Appalachic who will talk more specifically about CATS
and what it does.
Checking, okay, seems to be working.
Thanks Colin.
So I've been introducing the climate error task schedule, and so CATS figures out the
best time to start your HPC job, or really a job that you want to run in a laptop.
The users need to submit run times, of course they need to have some idea of how long the
job will take, and a postcode.
So this graph shows for example that if you schedule a job to the optimal time, you can
save your carbon footprint by 70%.
So right now it's like a proof of concept, we haven't released version one yet.
It was built in one day at these SSIs collaboration workshops.
It's kind of a hackathon that we have every year.
It won the first prize.
So what it is is a Python script, and it targets the at scheduler, so the at scheduler on Linux
and BSDs, schedules the starting time of a job.
So if you want a thing to start at 6 in the morning, it's at 0600, and it will start the
job then.
So that's a very easy, simple way to integrate CATS into this scheduling system.
So that's what we started with.
So the limitation is of course that if your HPC is running at full load all the time,
then it doesn't matter what you shift around because it will always be 100%.
So it's really meant more for computing systems that are not on all the time, so time shifting
will actually make a difference.
Of course you'll need to know how long the job will take.
Without that, this doesn't work at the moment.
If it's an HPC, then other users might be trying to run at the same moment, so how do
you do that sort of resuscitation?
Currently, it only works in the UK because we're using the Carbon Intensity UK API.
Of course, we open to pull requests for other APIs.
If other countries have energy data, energy forecasting data, so far we have not found
any free to use APIs.
So if you know, please let us know.
And of course, this is not the only thing you should do.
You can use cooling or there's a lot of carbon, embodied carbon in the manufacturing cost of
the server that things are running on.
So that's probably why you have HPC refreshers because newer servers are much more efficient.
So there's a lot of considerations, but in terms of what you as a user can do, I think
this is a good place to start.
So the way to use CATS is, it's a Python script, so you run it as a module, give it a job duration,
give it a postcode.
The postcode is a proxy for the location.
And using this, it fetches data from the Carbon Intensity API and it calculates the optimal
starting time for your job.
So we return data in a format that can be passed to AT and it also gives additional data in
JSON.
So what the other feature of CATS is that it will try to inform the user about the carbon
savings that take place when they're offsetting the time.
So we do some estimates on how much carbon would be saved by using some year after specify
configuration using like what kind of CPU or GPU you have, kind of power use efficiency
and thermal design power.
And using that, we have a simple formula that will calculate the amount of carbon that you
would have saved by running it in this time that CATS wants you to run it rather than
will right now.
So this is a demo of CATS running.
So we specify duration of 60 minutes.
And this is RT1, which is redding.
And it gives a time on 16th May, 1130.
And it also shows you the amount of carbon saved.
And that scheduled, so this is a last bit, you can run what you want as long as it takes
around 60 minutes.
Okay, that's it.
So if you have any questions, please email us or put up an issue in our GitHub.
And I skip the slide.
So what we're doing is we are going to release vision on this month.
We're cleaning the common line options.
So right now we only support AT, but we also want to support SBAT, which is the Slurm command
scheduler similar to AT.
And so we'll clean on the command line options and we'll release a vision.
But of course, going forward, we want to integrate with Slurm, which is the main scheduler for
ACs.
And the simplest method, of course, is SBAT, so the start time, the other option is green
cues where you have cues that don't run 100% because again, with running 100%, you don't
get the benefits of cats.
So we have green cues which don't run 100% and also integrating carbon accounting.
So Slurm has a plugin that allows you to look at the total power used in your job.
So way to integrate carbon accounting into that would be great.
That would require rewrites in C because Slurm plugins are in C.
And we have got funding from the Software Sustainability Institute for a few months of developer time.
So we're looking forward to making great progress and have something more polished soon.
And yeah.
And now, thank you.
Thank you.
Thank you, Kavain and Abish.
We have time for some questions.
Yes.
Thank you for your last presentation.
But my question is how often clusters are not load on 100%?
Yes.
I mean, that's a very good question.
So I think we need to think about how these things will work moving forward.
So we are talking with HBCs and I think this is more proof of concept and prototype to
look at how these carbon footprints might be achieved.
Obviously, if you do keep things 100% on time, then there is no point.
But I think if we do move to a future where we don't move to a net zero grid soon, then
we might see funders asking for carbon budgets or government putting carbon budgets just like
your financial budget.
And then you do want to see this.
So it's more like playing the groundwork for what might come next and not for the current
generation of HBCs.
Any more questions?
Sorry.
I wasn't looking over there.
Thank you very much.
Couldn't you use also energy prices like energy stock prices for forecasting?
Are they usually at least in Germany or lower when there's more renewable energy in the
net?
Thanks.
Yeah, that's a good point that we'll look into.
A good place to start with is to correlate, I guess, energy prices.
Yeah, yeah, we'll do the care and look at carbon intensity.
Thank you.
Hi.
I don't want to be Dell specific, but there's a web based tool from Dell called EIPT, Enterprise
Infrastructure Planning Tool.
You can configure servers with your particular build and you get their CO2 and power usage
for various workloads, computational memory intensive and idle.
And also Dell have a plugin for their IDRAX called Power Manager.
You can take a rack or a group of workstations servers and turn down the energy capping dynamically.
And that also works for Fujitsu and HPE servers for other brands.
I wonder if you would like to extend your work into that rather than scheduling the
jobs to turn down the power cap dynamically.
That'd be quite possible.
Thanks.
Yeah, I was at a word that there was a carbon accounting plugin for Slurm already.
So yes, ideally we have something that goes into Slurm and does the carbon accounting
rather than having user-shadow jobs.
That's definitely not optimal.
So yeah, thanks.
We'll look into that.
Any more questions?
Sorry.
Are you also planning to integrate this into HD Condor?
Sorry, could you repeat?
Are you also planning to integrate this into HD Condor?
It's another scheduling software?
Not at the moment.
At the moment we are focusing on Slurm, but once we have got something there we'll focus
on other schedulers.
Okay, thank you.
Hi there.
First of all, love it.
So thanks very much for the talk.
I'm looking forward to version one being released just to throw another thing out there that
you could do.
It'd be cool to have two locations and you can imagine an office having two offices in
the UK and they're kind of deciding where to deploy that task.
Maybe that extends even globally and you can choose where to run your task.
Yeah, spatial shifting would be a nice thing to have.
The problem with that is moving the data with it.
You need to have the data in both locations or move the data in advance of the job starting.
Anyone else?
That's one over there.
This is really just a comment.
I made a calculation of how much commuting to work costs in terms of energy compared
to what the supercomputers in that computing center are.
Just to give you a perspective, usually when you go to work and go back home it's like
the 100% that you can save here.
So just to give in general to whenever you're doing such savings estimates, just be aware
that tell your employer that just by working from home you can save that exact amount.
Sorry, it's just a statement.
Just a quick question on how CATS works.
I don't recall if you got into detail about it, but does it use the API data for time
series prediction or does it have some kind of internal calculation to do that?
It tries to find the lowest carbon window in the forecast period.
So if your job is going to be six hours it will try and find the lowest six hours in
that period and schedule your job to that time period.
Okay, but the predictions are already available.
The predictions are not made by us.
The predictions are available from the carbon intensity API.
Thanks for the talk.
I have one question.
If you already saw an API which is called Wattime API, I think it does the same, but
all over the world.
I wasn't aware of that one, but that is very useful.
We will look into that one.
Hi there.
One, two.
You can hear me, right?
Yeah.
Hi, I'm Chris Adams from the Green Software Foundation.
I'm curious about whether when you're doing this, if I know what jobs I did last year,
is there a way for me to run any of this against, say, last year's worth of compute jobs to
see what my savings could have been if I were to find something like this?
Because a lot of this is forward-looking, but if I've got some data now, that will make
it easier for me to make the case that this could create some meaningful savings inside
my team or inside my organization.
Yeah.
Slurm normally has some accounting built in that logs when jobs ran, so that would give
you the data that you'd need to go and do that.
I think there's also a simulator available, which is something we want to look into in
our next phase as to whether we can make use of that to do that kind of prediction, or
not prediction analysis.
I saw your talk last year, and that was actually some of the inspiration behind this.
I think that's it.
One more.
Hi.
I was wondering, we're also in the purchase of buy-in new cluster, and we have also raised
this thing with the users.
Turns out they don't really like this.
The idea that there would be free resources in the cluster, and their job would not start
immediately.
Is this a cluster with people actually using the HPC clusters?
So far, it's very hard.
We're finding that from users, there's sort of a three-way split of some who care enough
to definitely go out their way to do things, some who might use it if it was available,
but aren't going to go out their way, and some who they want their science now, and
they don't care what the carbon emissions are.
I guess we can target the other two groups first, and the other third might get dragged
out, kicking and screaming one day when carbon accounting comes in for them.
Okay, can we thank the speakers again, please?