Okay, now we have David from the Alan Turing Institute who's going to tell us about RC tab. Thanks very much. Yeah, I'm David Llewellyn-Jones. I'm very happy to be here at the HPC, Big Data and Data Science Dev Room. I'm relatively new to the HPC area, so I'm still getting used to all of the terminology and everything involved. It's a very exciting area, so it's nice to be part of the community. So as it says, I'm from the Alan Turing Institute. That's the UK's National Institute of Data Science and AI, and I'll talk a little bit about what the Institute does and the relationship with RC tab. RC tab is a cloud management system that's been developed at the Institute. I'm going to talk a little bit about that, and then I'll also talk about the process that was gone through to turn it from an internal project into an open source project. I think there's some interesting things to be learned from that. So I'm part of the research computing platform teams at the Institute, and I work with various other people in the team. So Thomas Lasauskas, Instentson, SA Ben and Joe Palmer, alongside myself, are the current team, and I should say that all of the people in that list, apart from myself, are RC tab developers. I'm just an RC tab user. I use it in my day-to-day work. But there are some previous people on the team. Oscar, Giles and Pam Rochner were also developers of RC tab, so they also contributed to the work that I'm talking about here. So first of all, a little bit about the background to RC tab. So I'm part of the research computing team, as I mentioned. The research computing team essentially supports other researchers in the Institute doing research on compute systems. So that could be high-performance computing systems, or it could be cloud systems. The Institute itself doesn't run its own HPC clusters, but it does have relationships with a number of other clusters in the UK. So for example, it has relationships with Baskerville, with J2, with Archer 2 and with Dawn, which is the new Intel-based system, and who are here today, I guess, represented. Yeah, that's nice. So we basically are the interface to those other systems. And as well as those high-performance computing systems, we are also the interface to the Azure cloud that researchers are also keen to use as well. Not everything that our researchers want to do needs high-performance computing. They might want other sorts of systems. For example, they might just want to run a website, for example. And in that case, a high-performance computing system wouldn't be appropriate. Azure is something that would be more appropriate there. So yeah, the Alan Turing Institute itself is not that large, but we do have projects with over 400 researchers. So it is quite a large number of projects that we're managing. And so therefore, managing those cloud projects, the interface with the cloud, can be quite challenging. So why is the cloud useful in research? Well, researchers are very keen to use the flexibility of the cloud. They don't always need to use GPUs, but even when they do want to use GPUs, they might also want to do it on the cloud. A lot of the cloud providers are now moving into the same sort of area that the HPC systems are providing. The cloud provides a lot of flexibility. And in addition to that, it's also quite attractive for users that don't necessarily have a Linux background, that don't have a background of using a Qs, for example, to deploy their jobs. A lot of our users, a lot of our researchers are very keen to use cloud systems. However, there are downsides to that. And in particular, one of the problems that we find is that with cloud systems, the companies, Microsoft or Amazon or whoever is running the cloud service, aren't very keen to put restrictions on the costs that people can accumulate whilst using the cloud. So you can use cloud services and you can have them on your project and you can use them as much as you like. There won't be a cap. If you go over your funding limit that your project has assigned to you, you can just carry on using them. And actually preventing projects from going over their spending limits is a real challenge. So these clouds are very flexible, but they also provide a lot of scope to shoot yourself in the foot as a researcher using more resources than you should be able to use, than your project funding allows, for example. So there are certain ways you can manage that. You could, for example, assign a credit card to each of the projects, have a credit card associated with each of the projects that only has a certain amount of funding on it. And when that gets used up, then Microsoft will cut off your access to the service. But that's not really practical in a research institute that has tens, hundreds of projects. Managing that process is really hard. Azure and AWS will send you alerts when you get certain limits. They won't cut you off, but they will tell you when you get certain alerts. And so you could use that as a mechanism, for example, if you wanted to manage it in a sort of a manual way. But again, that's not really practical in a very large institute. So there really needs to be a way for project managers, people who are working, for example, in the research computing group, to manage those projects in an automated way. And that's essentially what our CTab is for. So I'll talk a little bit now about how Azure structures its resources. I'm sure many of you will be familiar with this, but in case you're not, Azure structures resources underneath subscriptions. And they're a natural way, essentially, to manage project access to Azure resources. So you can fire up virtual machines, you can fire up firewalls, you can fire up instances or other web services, for example. And they will all fall underneath a particular subscription. In addition to that, you can also put access control on those subscriptions. So only certain researchers would have access to those subscriptions, but once they've got access to the subscription, they can then fire up resources under those subscriptions. So they're a very natural way to manage projects, essentially. However, as I said, although you can put funding caps on these subscriptions, they won't get turned off when they reach their limit. So these are essentially the mechanisms that we use with our CTab to manage the subscription to manage the resources is via these subscription entities. So our CTab is a tool that has been developed in-house to essentially perform this process. So it tracks the spending that's happening on Azure for individual subscriptions, which are related to individual projects. It manages the budgets that are associated with them so that people can see how much they're using and they can track that resource usage. It will notify users of various points in the process of using that spending, but then crucially, which is the thing that we really need, it will also deactivate subscriptions when the money has been used up. So it will prevent researchers, it will prevent projects from going over their budget limits for their resources. And it will do this on a very large scale. So I'm going to talk a little bit about how it's structured. So this is the background to it. So it is open source. You can go to the GitHub repositories and get access to the code. You can deploy it yourself. And it is deployed itself on Azure, so it is an Azure service. It's split up into essentially four pieces. So there is the infrastructure repository that handles the deployment of the other services to the cloud, to the Azure instance. There's the CLI, the command line interface. The command line interface allows you to manage the subscriptions. So that is essentially the right access to the backend database that manages these things. Then we have functions, which I'll talk about a little bit more detail later, but the functions essentially are the cron-like jobs that run in the background, which manage access, which monitor resource usage, and also perform the job of turning off the resources. And then there is also an API that can be used to create a web interface for access to those resources as well. And the functions in the API are all deployed as Docker images. So if you wanted to use this yourself in your own organization, you could pull those images directly from Docker Hub and deploy them. And it's quite a straightforward process. And so until I search for it, I did not know what this image at the top meant, but that is a webhook image. All of these things are managed by GitHub webhooks essentially. So when we change the code, these things get deployed automatically to Azure. It's a very smooth process. So I'm going to show you now a little bit about how the interface looks. So this is the web interface that's provided by RCTAB. And you can see that there's essentially a list of subscriptions that you can see on the left-hand side. And each of those subscriptions has data associated with it. So it has the name. The name is pulled from Azure. The budget that's associated with it, which doesn't come from Azure, that is allocated by us using the command line interface. So we can allocate budget. And then the usage is monitored in the background. And when the remaining budget, which you can see on the far right-hand side, reaches zero, then the instance will be shut down. The subscriptions will be shut down. And all of the resources will essentially go into zero cost. And you can see that in this case here, there are a number of subscriptions that have reached zero. And they've been terminated. As you can see, they've got, I think it's a little skull and crossbones on the left-hand side to show that they've been terminated. And you can drill down into this. So you can select one of the subscriptions to find out more information. And then you get a page like this. And here, again, you can see not only the costs associated with it, but you can also associate costs with project funding and with a ticket. So we use a ticketing system so we can manage all of those things really, really cleanly and nicely. And it's not just us and the research compute team that have access to this. It's also the researchers themselves that have access to this interface. And because the access control list can be pulled from Azure into RCTAB, we can ensure that only the researchers that need access, sorry, only the researchers that have access to particular subscriptions can gain access to the information about those subscriptions. And I don't show it here, but you can drill down further. There are graphs that you can see about your usage and what your costs are and what your costs are likely to be as well. So it helps the researchers to manage their budgets that they have on Azure. So that's the web interface. The web interface is entirely read-only. So that's for looking at the information. You don't get the ability to change any of the information through this interface. If you want to change the information, you have to use the command line interface, which looks, some of it looks like this. It has quite a lot of functionality. This is the subscription functionality that I've just put the help up for here. And as you can see, I've also put a command on the bottom which shows how we allocate the additional budget. And this is essentially the tool that I'm using day to day. So I'm basically, when I get a ticket, I see how much funding is needed, is being allocated to a project. I check everything's okay, and then I run the command, and the command then allocates money to the project or removes money to the project. And we can also set end dates for projects as well. So the subscriptions would also get shut off if the end date is reached. So I'll just talk a little bit about how the functionality works in the background. It's a very high level. But you can see that we've got the websites, the CLI, that are both on the right-hand side there, the interfaces to the system. There is also email integration. So we use send grid, which is a third-party add-on to Azure to send out emails to users to tell them that their subscription has been terminated or that they're reaching nearing the end of their subscription, close to the termination period. And there's a database in the middle, which captures all the information, as you might expect. But then there are these three functions. So there's the status function, the usage function, and the controller function, which run their cron-like jobs that run in the background. So they actually run periodically. They run every hour, in fact. And the status function is monitoring the status of a subscription. It will send out emails to users to tell them that their subscription has been either activated or deactivated. The usage function measures the cost that has accumulated on a subscription. So again, it will also send out emails to users, but it then also feeds that information into the database, which is used by the controller function to ultimately shut down the subscription when the remaining usage has been used up. So all of these things are running in the background. And they're essentially providing the services that you might think that would be available on Azure already to do this, but I guess it's not in Microsoft's interest to do so. So we also have this group management structure that is used for this, as I mentioned. All of this is deployed to Azure. So the actual RCTAB infrastructure is deployed to Azure as well. And you have to manage this quite carefully in an institute like the Turing, because we have a lot of projects. Some of them have a lot of private data and those things, because in effect RCTAB has access to subscriptions, we have to make sure that those are not monitored by RCTAB. So we have a managed group that is for special case subscriptions that you can see on the far right hand side for you. And they are not managed by RCTAB. Even within RCTAB, we also have two groups. We have managed and not managed subscriptions, which are both monitored so that users can find out the information. But there are certain subscriptions that we don't want to shut down, even if they go over budget. One example of that is RCTAB itself. It's very important that RCTAB doesn't shut itself down. That would lead to bad consequences. So there are certain things that we don't want to shut down and we keep those separately as well. All right. I'll talk briefly now about the process of open sourcing. So originally RCTAB was an internal project. So we've been using it for, I guess, since 2020. So between 2020 and 2023, it was an internal project. It wasn't open source. It was essentially closed source in some private GitHub repositories. And it was switched to an open repository around April 2023. So it's now been available as open source for a little under a year. And the process of open sourcing was actually quite time consuming. So we had to go through full code audit. We wanted to remove, so again, I'm not a developer, so I was only adjacent to the process. But there was a lot of work that went into it, essentially making the code better, checking that there weren't secrets within the code, for example, which is one of the problems which happens when you've been running an internal project for a long time. And in fact, one of the motivations for keeping it closed source originally was to try and avoid any, reduce the risk essentially of potentially, if you push a secret to a repository, it might become exposed. So there was a lot of work using Gitleaks for checking for these secrets, Grep, also for the same reason, a lot of grepping to check for the stuff, and Vulture for unused code. And ultimately, I think, as it says there, 25% of the code was actually removed, 25% of it was refactored. It was quite a lot of work to do that. And there was a process of moving from one repository to another. And in fact, that meant running two deployments of RCTAB simultaneously for a period of time. The choice was used to use Pulumi as infrastructure as a code for deploying RCTAB. And that was also part of the open sourcing process, because there was a need to make the deployment a lot easier if other people were going to be able to use it. And Pulumi was chosen essentially because it's open source, because it's using Python, in contrast to examples like Terraform, which are closed source and which have their own declarative language, RCTAB itself is written in Python. Pulumi seemed like a natural choice. And it seems to work really well. So if you want to deploy RCTAB now, you literally just have to run Pulumi up, and it will deploy all of these things to Azure in a very sleek and streamlined process. So it actually works really nicely. So the Alan Turing Institute itself is, in theory, committed to open source code. One of the things that we champion that we promote is the idea that you should have reproducible research, and for that you have to have open source code in the main. So you'd reasonably ask the question, why wasn't it open source to begin with? Well, I already mentioned one of the reasons. It was to reduce risks initially. It was also because it was never really perceived as being something that would be needed by other organizations. But over time, it became clear that actually other organizations could also really benefit from that functionality. And finally, there was also the fact that this process of refactoring the code was something that was ongoing anyway. So actually converting it to open source was both a motivation for cleaning up the code, but also motivated by the fact that that was happening anyway. And ultimately, it reached a stage where it was generalizable enough and also potentially beneficial enough to other organizations that it seemed like the natural thing to do. But there wasn't really originally any good commercial reason for keeping it closed source. But in practice, that's probably something that should have been done from the start. All right. So in conclusion, well, managing costs on cloud platforms like Azure is difficult for large organizations that have a lot of projects. RCTAB provides now a very stable and battle-worn solution. We've been using it for three years with many different projects. It seems to work very well, at least for our needs. And we really expected that there would be something else out there that would do this job. It's not really the natural thing for people to want to do. But in practice, we just didn't find anything that fitted the bill. And then finally, this process of open sourcing, it was time-consuming, but ultimately, it felt like it was a very beneficial process to go through and the code has benefited as part of that. All right. Thanks very much for listening. Thanks, David. Do we have any questions? Hi. So when a project runs over the budget, you shut down the instances to reduce the cost to zero. Well, so the subscription is deactivated and the consequence is the same. Okay. And what happens with storage, for example, storage accounts like Azure Archive storage or like Blob storage? Yeah. So the storage is retained because that's generally very low cost. Okay. But they continue producing costs, right, if the storage is retained? So there is a small cost to that, yeah, exactly. Thanks. Any other questions? Hi, David. Hi. Just a question off the cuff. How tightly coupled are you to the Azure API? I'm so sorry. I couldn't hear properly. How tightly coupled is this code to the Azure API? So the question is how tightly coupled is it to the Azure API? Yeah. So that's a really good question. So at the moment... Can people be a bit quiet, please? So like I said, I'm not an RCTab developer, so take this with a level of uncertainty. My understanding is that conceptually, it's not strictly speaking tied, so the concepts will transfer over, but practically speaking, it makes heavy use of the Azure API. Yeah. So at the moment, it would be possible to convert it to something like AWS, say, but practically speaking, that's as I understand it, quite a lot of work at the present stage. Yeah. Are you using something else? Or do you think there's a practical benefit to that? Yeah. Yeah. Okay. Thanks. Thank you. Hey. So first I want to say thank you for the talk. It was kind of a little bit, I don't know, annoying that they were just coming in and just interrupting your talk. I find it was rude and I wanted to say that. And the other thing I want to say or ask is that, so you say it might be not in Microsoft best interest to have like a so delicate model of managing the cost for these kind of projects. I think that might be true. But I was wondering, you have a quite high complicated structure of like managing the budgets for projects. I think that's something, yeah, special I guess. Is it? So your question is, so you're saying that, yeah, my question is like, that it's an unusual situation. Yeah, exactly. Other universities might have the same thing, but not other places. Yeah. I mean, that's a really good, I think that's a good question. So we do have a lot of subscriptions, but I think we're probably not unique. So you mentioned universities. I think other universities will probably be in a similar situation. I think it's, and I guess it is also true that the nature of the projects we do are that they come with defined funding. So they can't, if they go over their budgets, that's a problem. And that may not be the case in a lot of other organizations. So you're right. I think it probably is a special case, but I'm sure there are, I would expect there to be other organizations that fall into this situation as well. Yeah. Okay. Thank you for answering. Yeah. Thanks very much for the question. Very quickly, the questioner said that it might not be in Microsoft's interest to help you limit your budget. That's, if you work for Amazon, you are actively rewarded to save your customers money. And I'm sure that's the same for Azure. They like to hang on to the customers and they can help you save your money. You are rewarded for that company. Okay. And I guess I do want to make clear that, yeah, Microsoft provides us a very good service, I have to say. I don't want to give the impression that I'm putting down Microsoft by saying that. I guess that was a mistake to suggest that. However, it does seem to be the case. I guess I should put it in a slightly different light. The nature of cloud is that people want access to the resources when they need them and they don't expect them to be shut down. That's not the nature of how people perceive the cloud, I guess. So in that sense, it's probably not in Microsoft's interests because that in some sense, to me, that feels like it's not the service that they're providing. They're providing access to resources and they almost make it feel like it's unlimited access to resources, even though obviously it's not in reality. But in practice, that's kind of the service they're offering. Is that, would you disagree? Sorry. Okay. Clear enough. Okay. I think we have to stop now. So let's thank David again.